pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-06 17:24:59 +08:00

Author	SHA1	Message	Date
Sam Ginzburg	9e4229de28	[inductor] getting AOT inductor to treat None args correctly linter remove import ghstack-source-id: 72ceaf4a8e8c5bb2c465cf293c1e436876186645 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138910 lint address feedback lint nit	2024-10-28 13:33:27 -07:00
Sam Ginzburg	e6180cd8ed	Bugfix for passing None args to user defined Triton kernel (#138472 ) add test fewer failing tests more tests passing tests passing lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/138472 Approved by: https://github.com/aakhundov	2024-10-28 10:37:25 -07:00
Nikita Shulga	d2e81d9c6f	[CI/CD] Disable split build (#138752 ) See https://github.com/pytorch/pytorch/issues/138750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138752 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-10-28 10:37:25 -07:00
Nikita Shulga	7e274747a9	[EZ] Fix typo in test_mps.py (#138738 ) s/emedding_weight/embedding_weight/ Stolen from `074766d9b4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738 Approved by: https://github.com/atalman	2024-10-28 10:37:25 -07:00
drisspg	52ca2d7075	Fix test on windows (#138641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138641 Approved by: https://github.com/huydhn	2024-10-28 10:37:25 -07:00
Animesh Jain	f9c9b2f290	[hierarchical-compilation][inductor] Support invoke_subgraph HOP (#138031 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138031 Approved by: https://github.com/eellison ghstack dependencies: #137538, #138036, #137965	2024-10-28 10:37:25 -07:00
Gabriel Ferns	9b8015ccf8	Add dump_launch_params config in triton/inductor (#137143 ) Summary: Moves the checking of TORCHINDUCTOR_DUMP_LAUNCH_PARAMS into the config module to pull it out of the critical path. Test Plan: Existing unit tests cover this env variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137143 Approved by: https://github.com/eellison	2024-10-28 10:37:25 -07:00
Edward Z. Yang	beaa796c5d	Refactor: Move _nested_int_aware_sort top level (#138693 ) I need to use it from some other places later in the PR stack Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138693 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-10-28 10:37:25 -07:00
Pian Pawakapan	29191cc334	[export] fix test_unbacked_bindings_for_divisible_u_symint (#138607 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138607 Approved by: https://github.com/angelayi	2024-10-28 10:37:25 -07:00
Richard Barnes	4166d499a3	Clean up a c10::optional and fix documentation (#138700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138700 Approved by: https://github.com/Skylion007	2024-10-28 10:37:25 -07:00
Tom Ritchford	03b371eaeb	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-28 10:37:25 -07:00
Catherine Lee	b390a00516	Do not run CI on forks (#138714 ) Add `if: github.repository_owner == 'pytorch'` for some jobs that were missing it Fixes #138564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138714 Approved by: https://github.com/huydhn, https://github.com/kit1980	2024-10-28 10:37:25 -07:00
Laith Sakka	1d8ced9d1d	Introduce torch.sym_add, variadic add (#138660 ) Tested internally here: https://www.internalfb.com/diff/D64057744 This is a reland after previous internal failures. main change is ``` if min is None and max is None: torch._check_is_size(size) return ``` Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660 Approved by: https://github.com/ezyang, https://github.com/bobrenjc93	2024-10-28 10:37:25 -07:00
Laith Sakka	8a6174dc2b	Generate slice.Tensor view operations instead of as_strided when split is used in the original program. (#137225 ) test_recompile assert that the changes do not add more recompilation by comparing with eager backend. The reason of this is because slice can be lowered in more efficient way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137225 Approved by: https://github.com/zou3519	2024-10-28 10:37:25 -07:00
Tom Ritchford	65cf6d012f	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-28 10:37:25 -07:00
Felix Su	3c15220ddb	[SJD] [RFC] force setting last progress time (#138615 ) Summary: Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization The workaround is to update the healthcheck time when calling `get_last_progress_time` Test Plan: Logs show that the progress time value is being changed despite client not being set Behavior when watchdog is enabled with SJD config is left unchanged Differential Revision: D64733766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615 Approved by: https://github.com/gag1jain	2024-10-28 10:37:25 -07:00
PyTorch MergeBot	583a007aa5	Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527 )" This reverts commit 8fbf866904661b16cba4c799af81121557ba9da8. Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))	2024-10-28 10:37:25 -07:00
Jeremy Hadidjojo	25eeca7eb7	Make trace log dir persist through multiple `set_logs()` calls (#137793 ) Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call. Reviewed By: ezyang Differential Revision: D64118047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793 Approved by: https://github.com/ezyang	2024-10-28 10:37:25 -07:00
Alex Baden	d68453d068	[Inductor] New Triton Attrs Descriptor Fixups (#138390 ) Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390 Approved by: https://github.com/jansel, https://github.com/huydhn	2024-10-28 10:37:25 -07:00
Jean Schmidt	8424f6173c	[CI] Introduces experiment `awsa100` to `inductor-perf-compare.yml` workflow using `_runner-determinator.yml` (#138204 ) Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`. It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark. Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2024-10-28 10:37:25 -07:00
Richard Barnes	82dcfaa887	Eliminate c10 string_utils (#138499 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138499 Approved by: https://github.com/swolchok	2024-10-28 10:37:24 -07:00
Sun, Jiayi	c2aecbe5e4	[Quant][Inductor] expand quantization conv-binary(-unary) pattern fusion inside inductor (#138051 ) ### Summary Expand quantization conv-binary(-unary) pattern fusion inside inductor to support the following two patterns: Pattern 1: ``` Conv(X) extra input \ / Add \| Optional(relu) \| Y ``` Pattern 2: ``` extra input Conv(X) \ / Add \| Optional(relu) \| Y ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138051 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2024-10-28 10:37:24 -07:00
chuanqiw	d506df5d5e	[CD] fix xpu support packages version (#138189 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138189 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/atalman	2024-10-28 10:37:24 -07:00
Ke Wen	63fa95cf2f	[PGNCCL] Use non-blocking mode by default in eager init (#138527 ) ### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR https://github.com/pytorch/pytorch/pull/137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527 Approved by: https://github.com/wconstab ghstack dependencies: #137855, #138488, #138374, #138384	2024-10-28 10:37:24 -07:00
Sheng Fu	c8d7ea2038	Fixed dead lock in execution trace (#136892 ) Summary: This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL. One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda Differential Revision: D63556608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892 Approved by: https://github.com/aaronenyeshi	2024-10-28 10:37:24 -07:00
titaiwangms	453041e58f	[ONNX] Fix sequence handling in graph building (#138656 ) Previous to this PR, op.Concat is called without required attributes: axis, and val and arg seems wrongly coded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138656 Approved by: https://github.com/justinchuby	2024-10-28 10:37:24 -07:00
Ting Lu	8ba468e0d6	add CUDA 12.6 to conda docker image (#138417 ) Adds cuda 12.6 to common installation script. Adds cuda 12.6 to conda docker image build matrix. fixes #138440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138417 Approved by: https://github.com/cyyever, https://github.com/atalman	2024-10-28 10:37:24 -07:00
Bob Ren	2174aab940	Add support for SymFloats in split_module fx pass (#138599 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599 Approved by: https://github.com/ezyang	2024-10-28 10:37:24 -07:00
Bob Ren	94fce5beda	Support conditionals on sym node variables in the __bool__ and __len__ case (#138595 ) As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595 Approved by: https://github.com/ezyang	2024-10-28 10:37:24 -07:00
titaiwangms	1a1ec9f15f	[ONNX] Avoid optimize `onnx_dynamo-fallback` (#138265 ) Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph. It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265 Approved by: https://github.com/xadupre, https://github.com/justinchuby	2024-10-28 10:37:24 -07:00
Prajesh Praveen Anchalia	e60d44efd6	[PyTorch] Classify Unsupported mutated Dynamic Shapes as User Error (#137054 ) Summary: We don't need an assert on for unsupported dyn shape inputs, removing the assert and raising a user exception instead. Differential Revision: D63661569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137054 Approved by: https://github.com/bdhirsh	2024-10-28 10:37:24 -07:00
cyy	1797f1c9dc	Update ruff to 0.7.0 (#138597 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138597 Approved by: https://github.com/ezyang	2024-10-28 10:37:24 -07:00
Sam Larsen	ae3bb0645d	[easy] Log subproc pool creation (#138642 ) Summary: Request from internal to log subproc pool creation Test Plan: ``` $ TORCH_LOGS=+torch._inductor.async_compile python ~/add.py I1022 14:12:41.915000 444394 torch/_inductor/async_compile.py:165] Creating subprocess pool with 32 workers ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138642 Approved by: https://github.com/eellison	2024-10-28 10:37:24 -07:00
cyy	3efe2f4e05	[1/N] Don't skip ASAN on some tests (#138571 ) Clang15's ASAN is new enough so that it's possible to re-evaluate the disabled ASAN tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138571 Approved by: https://github.com/ezyang	2024-10-28 10:37:24 -07:00
Henry Tsang	202e398db3	[tests] fix broken tests caused by AotEagerAndRecordGraphs typo (#138492 ) Summary: Name change happened in https://github.com/pytorch/pytorch/pull/138231 AttributeError: module 'torch._dynamo.testing' has no attribute 'AOTEagerAndRecordGraphs'. Did you mean: 'AotEagerAndRecordGraphs'? Test Plan: ci Differential Revision: D64704686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138492 Approved by: https://github.com/aakhundov	2024-10-28 10:37:24 -07:00
Wouter Devriendt	0893b5dbd4	Update torchbench.txt (#138569 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138569 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-10-28 10:37:24 -07:00
Ke Wen	011e9bd00d	[PGNCCL] Ensure comm is ready before all accesses (#138384 ) Previously we only wait for comm to become ready after its initialization. That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc. Therefore, we just ensure comm is ready every "next time" we need to access ncclComm. The place to add such gate keeper is `getNcclComm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384 Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj ghstack dependencies: #137855, #138488, #138374	2024-10-28 10:37:24 -07:00
Mikayla Gawarecki	8bcd9e543c	Fix .to(cpu) for Storage (#138011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138011 Approved by: https://github.com/albanD	2024-10-28 10:37:24 -07:00
Bin Bao	108a8311d5	[AOTI][refactor] Move use_minimal_arrayref_interface logic (#138250 ) Summary: Move use_minimal_arrayref_interface specific logic from CppWrapperCpu to CppWrapperCpuArrayRef. This is a copy-on-write style refactor, to simply the default AOTI generated code. Differential Revision: [D64598715](https://our.internmc.facebook.com/intern/diff/D64598715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138250 Approved by: https://github.com/chenyang78 ghstack dependencies: #138544, #138379	2024-10-28 10:37:24 -07:00
Bin Bao	7b6905618e	[AOTI] Fix check_model_with_multiple_inputs in test_aot_inductor (#138379 ) Summary: Add missing use_minimal_arrayref_interface setting to check_model_with_multiple_inputs. Differential Revision: [D64635211](https://our.internmc.facebook.com/intern/diff/D64635211) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138379 Approved by: https://github.com/hl475 ghstack dependencies: #138544	2024-10-28 10:37:24 -07:00
Richard Barnes	b25e2b459d	Remove some pre-cpp17 stuff (#138410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138410 Approved by: https://github.com/Skylion007	2024-10-28 10:37:24 -07:00
Tugsbayasgalan Manlaibaatar	ab3bf915d2	Wrap autograd and autocast ops in training IR (#138516 ) Differential Revision: [D64732361](https://our.internmc.facebook.com/intern/diff/D64732361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138516 Approved by: https://github.com/yushangdi ghstack dependencies: #138261	2024-10-28 10:37:24 -07:00
PyTorch MergeBot	7f11f58108	Revert "[Inductor] New Triton Attrs Descriptor Fixups (#138390 )" This reverts commit 215999452eb5517213b3a31f72eb9a7e843d12a0. Reverted https://github.com/pytorch/pytorch/pull/138390 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it still has another lint error ([comment](https://github.com/pytorch/pytorch/pull/138390#issuecomment-2430566004))	2024-10-28 10:37:24 -07:00
Tugsbayasgalan Manlaibaatar	efd6c418cc	Move test_serialize to training IR (#138261 ) Differential Revision: [D64572253](https://our.internmc.facebook.com/intern/diff/D64572253) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138261 Approved by: https://github.com/yushangdi	2024-10-28 10:37:24 -07:00
Laith Sakka	63ae5e9554	Remove parallel_and and parallel_or (#138135 ) Not used, suggested by @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135 Approved by: https://github.com/ezyang	2024-10-28 10:37:24 -07:00
cyy	775752512f	[1/N] Enable cppcoreguidelines-special-member-functions (#137405 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405 Approved by: https://github.com/ezyang	2024-10-28 10:37:24 -07:00
wz337	eeb3f6bcc8	[EZ][DTensor] Update DTensor readme to use the new import path (#138625 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138625 Approved by: https://github.com/XilunWu	2024-10-28 10:37:24 -07:00
William Wen	3a52108033	[dynamo] reset compiler stance after test (#138277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-10-28 10:37:24 -07:00
PyTorch UpdateBot	d37fb92fb5	[executorch hash update] update the pinned executorch hash (#135287 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135287 Approved by: https://github.com/pytorchbot, https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2024-10-28 10:37:24 -07:00
eellison	c133661f5d	Disabling amp context when invoking compiler (#138624 ) Fix for https://github.com/pytorch/pytorch/issues/133974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138624 Approved by: https://github.com/bdhirsh, https://github.com/drisspg	2024-10-28 10:37:24 -07:00
Alex Baden	ec16fd258d	[Inductor] New Triton Attrs Descriptor Fixups (#138390 ) Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390 Approved by: https://github.com/jansel	2024-10-28 10:37:24 -07:00
PyTorch MergeBot	c4b4c1793b	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit 8aacbee8e0d6c03096f2ce94b70e2a8fab17ee81. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/wdvr due to this one has failing internal tests, not related to a landrace with #138398 - reverting this one ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2430460176))	2024-10-28 10:37:24 -07:00
Jesse Cai	2a80480777	[sparse] add search for optimal alg_id to torch.compile (#137427 ) Summary: This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal alg_id and cache it when running with `torch.compile` Seeing speedups on both bfloat16 and float8 dtypes: <img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b"> <img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6"> * `torch._cslt_sparse_mm_search` has been modified to return optimal split-k parameters as well as max alg_id. * max_id is now available in `torch.backends.cusparselt` via `torch.backends.cusparselt.get_max_alg_id()` * fixed meta registrations for float8 Test Plan: python test/test_sparse_semi_structured.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427 Approved by: https://github.com/cpuhrsch	2024-10-28 10:37:24 -07:00
Nikita Shulga	c81e2466c3	[EZ] Use `at::detail` nested namespace in Dispatch.h (#138633 ) Instead of `namespace at { namespace detail {` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138633 Approved by: https://github.com/Skylion007	2024-10-28 10:37:24 -07:00
Bin Bao	be281e74a9	[AOTI][refactor] Clean up test_aot_inductor skip list (#138544 ) Summary: Remove skips for already fixed tests. Change remaining skip to xfail so that the failure list can be more proactively maintained. Differential Revision: [D64761257](https://our.internmc.facebook.com/intern/diff/D64761257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138544 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-10-28 10:37:24 -07:00
James Wu	17f8cec511	Add support for adding extra metadata to chromium events, log to separate columns (#138477 ) This diff does a few things: ## Add metadata to events in progress Adds the ability to add extra metadata to Chromium Events via `add_event_data`. Metadata can only be added to chromium events that have started, but not ended (so, in progress events) - When you add the data, the metadata is appended to the metadata when you call log_event_end(). - The metadata appears in chromium events in tlparse. It also gets logged to scuba. ## New `dynamo` chromium event We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes: ``` __start__ -> dynamo (dynamo compile metrics) -> entire_frame_compile (compile.inner) -> backend_compile (i.e. aotdispatch) -> create_aot_dispatch_function -> inductor_compile -> ... ``` BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here. FAQ: Why can't we use `entire_frame_compile` as the event? This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo. ## Log metadata as separate columns (Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477 Approved by: https://github.com/aorenste	2024-10-28 10:37:24 -07:00
Matthew Francis-Landau	a3e229bec1	Fixes issue with torch._dynamo.assume_constant_result with global functions (#132431 ) This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten. Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function. This causes that function to be overwritten in the global scope. This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431 Approved by: https://github.com/jansel	2024-10-28 10:37:24 -07:00
Yiming Zhou	cfcd399c2e	[export] Add retraceability_non_strict to tests (#138380 ) Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability ``` Differential Revision: D64611532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380 Approved by: https://github.com/angelayi	2024-10-28 10:37:24 -07:00
Nikita Shulga	def2f4ee78	Update copyrights to 2024 (#138638 ) Spiritual successor of https://github.com/pytorch/pytorch/pull/119413 + CPP docs copyright update as well Fixes https://github.com/pytorch/pytorch/issues/138630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138638 Approved by: https://github.com/atalman	2024-10-28 10:37:24 -07:00
dependabot[bot]	51fbd5873e	Bump webrick from 1.7.0 to 1.8.2 in /ios/TestApp (#136593 ) Bumps [webrick](https://github.com/ruby/webrick) from 1.7.0 to 1.8.2. - [Release notes](https://github.com/ruby/webrick/releases) - [Commits](https://github.com/ruby/webrick/compare/v1.7.0...v1.8.2) --- updated-dependencies: - dependency-name: webrick dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-28 10:37:24 -07:00
Joel Schlosser	52efd5d4c0	Improve input validation for NJT pointwise ops (#138602 ) Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops. I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602 Approved by: https://github.com/soulitzer	2024-10-28 10:37:24 -07:00
cyy	e7a6590abb	[6/N] Fix extra warnings brought by clang-tidy-17 (#138572 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138572 Approved by: https://github.com/Skylion007	2024-10-28 10:37:24 -07:00
Ti-Tai Wang	9988121c55	[ONNX] Add complex constant support (#138279 ) Transform complex python constant to float representation as well, like what we have with tensors. PS: I find it's not reasonable to add "complex->float" in IR side, so I put it here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138279 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-10-28 10:37:24 -07:00
Mark Kim-Mulgrew	53c9de34f5	Remove unused enforce_cond_guards_match Dynamo feature flag. (#138589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138589 Approved by: https://github.com/clee2000	2024-10-28 10:37:24 -07:00
atalman	f78e347c40	Aarch64 binary builds - fix passing env_file to Docker (#138588 ) Aarch64 builds skipped the logic of sourcing binary env file. And as a result PYTORCH_EXTRA_INSTALL_REQUIREMENTS passed to Aarch64 builds have not included triton dependency constraint. This PR makes sure Aarch64 builds follow same path as our regular manywheel builds. To work around this issue we had to inject triton in aarrch64 builds for release 2.5, which is not ideal: https://github.com/pytorch/builder/pull/2011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138588 Approved by: https://github.com/jeanschmidt, https://github.com/malfet	2024-10-28 10:37:23 -07:00
eqy	316c9d4185	[Flex Attention] Don't compute fill order to compute stride order just to get fill order back (#138376 ) Was a bit confusing to read when working on #138354 "computer-assisted proof" ``` import random def argsort(seq): # preserve original order for equal strides getter = seq.__getitem__ a_r = range(len(seq)) return list(reversed(sorted(a_r, key=getter, reverse=True))) # noqa: C413 def stride_order2fill_order(order): """ Convert stride order to fill order For channel last format, stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0] """ lookup = {pos: idx for idx, pos in enumerate(order)} fill_order = [lookup[i] for i in range(len(order))] return fill_order def get_stride_order(seq): """ Convert strides to stride order """ sorted_idx: List[int] = argsort(seq) out = [0 for _ in range(len(seq))] a = sorted_idx.copy() for i, elem in enumerate(sorted_idx): out[elem] = i fillorder = stride_order2fill_order(out) assert fillorder == sorted_idx return out for _ in range(1000): a = [0, 1, 2, 3] random.shuffle(a) get_stride_order(a) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376 Approved by: https://github.com/drisspg	2024-10-28 10:37:23 -07:00
Max Podkorytov	8cfbfceded	[Inductor][ROCm][CK] add CK grouped conv2d fwd kernels to ROCm codegen (#137947 ) Plug into lowering and end to end test in a later PR Instance parsing companion PR https://github.com/ROCm/composable_kernel/pull/1585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137947 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2024-10-28 10:37:23 -07:00
Zain Rizvi	c8008102fd	[EZ] [BE] Remove (now) unused scale config (#138511 ) Final step of moving scale config files to test-infra repo. Details in https://github.com/pytorch/test-infra/pull/5767 Scale configs are now read from test-infra. This PR is just cleaning up stale files Pull Request resolved: https://github.com/pytorch/pytorch/pull/138511 Approved by: https://github.com/clee2000	2024-10-28 10:37:23 -07:00
Stefan-Alin Pahontu	07525763d0	Fix for MSVC problem on Windows Arm64 (#136765 ) This PR proposes a workaround for an internal issue introduced in MSVC 14.37 for Windows Arm64 target. It is still an ongoing problem. The fix will be released with the future versions of Visual Studio 2022 but until then the changes to cpu/vec/vec_base.h should be sufficient. We also opened a new ticket on Visual Studio Developer Community, it can be found here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136765 Approved by: https://github.com/malfet Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>	2024-10-28 10:37:23 -07:00
PyTorch MergeBot	7f69d96c71	Revert "Remove C10_DEPRECATED (#138406 )" This reverts commit 70ec86d7542d461ff6f01ba1a1c9a4f38637af8e. Reverted https://github.com/pytorch/pytorch/pull/138406 on behalf of https://github.com/wdvr due to failing internal tests - see D64714374 ([comment](https://github.com/pytorch/pytorch/pull/138406#issuecomment-2429912896))	2024-10-28 10:37:23 -07:00
Catherine Lee	01ff36e5e2	Upload artifacts during test run (#125799 ) Zip and upload artifacts while run_test is running Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799 Approved by: https://github.com/huydhn	2024-10-28 10:37:23 -07:00
Animesh Jain	b4f3a4c5c5	[hierarchical-compilation][invoke_subgraph] Use tracing context to cache artifacts of dispatch keys (#137965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137965 Approved by: https://github.com/zou3519 ghstack dependencies: #137538, #138036	2024-10-28 10:37:23 -07:00
Animesh Jain	fd779b287c	[hierarchical-compilation][invoke_subgraph] Graph break on input mutation or aliasing (#138036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138036 Approved by: https://github.com/zou3519 ghstack dependencies: #137538	2024-10-28 10:37:23 -07:00
Animesh Jain	85814f9047	[hierarchical-compilation][hop] Introduce invoke_subgraph (#137538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538 Approved by: https://github.com/zou3519	2024-10-28 10:37:23 -07:00
Jeff Daily	170c622400	[ROCm] index_put performance improvement (#138259 ) On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel. None of the existing unit tests were exercising the large index case so a new unit test was added. It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259 Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy	2024-10-28 10:37:23 -07:00
Bin Bao	9e901b34da	[AOTI][reland] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138541 ) Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed. Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541 Approved by: https://github.com/frank-wei	2024-10-28 10:37:23 -07:00
Colin L. Rice	442a63a9ba	config: simplify most of the config handling and fix some bugs (#138377 ) This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them. Cleanups - This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose. - I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows. - I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes). - I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired. - I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values. - I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher. For release notes, Not sure exactly how to communicate this, but something like "ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison	2024-10-28 10:37:23 -07:00
Edward Z. Yang	297258a623	Add type stub for SymInt.rsub (#138543 ) Fixes https://github.com/pytorch/pytorch/issues/138478 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138543 Approved by: https://github.com/malfet	2024-10-28 10:37:23 -07:00
Pearu Peterson	21ce69be05	Add out_dtype kw argument to optimize_bsr_dense_addmm (#136626 ) As in the title. Addresses the task in https://github.com/pytorch/ao/pull/821#issuecomment-2373290266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136626 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2024-10-28 10:37:23 -07:00
Simon Fan	53f25d34fd	[compiled autograd] tls access helpers (#138061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061 Approved by: https://github.com/yf225 ghstack dependencies: #137953, #137821	2024-10-28 10:37:23 -07:00
Simon Fan	e969f3f2d7	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-28 10:37:23 -07:00
Simon Fan	c9ff5375ff	[compiled autograd] directly use python Logger class in cpp (#137953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953 Approved by: https://github.com/jansel, https://github.com/yf225	2024-10-28 10:37:23 -07:00
angelayi	0a40ddd336	[aoti] Cond symint input support (#138373 ) If the input is a symint, we don't want to add the aoti_torch_assign_tensors_out Pull Request resolved: https://github.com/pytorch/pytorch/pull/138373 Approved by: https://github.com/larryliu0820, https://github.com/desertfire	2024-10-28 10:37:23 -07:00
Pian Pawakapan	1880dcfa78	make DimHints compatible with Dims (#138490 ) Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints. The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid. Current behavior is that: ``` class Foo(torch.nn.Module): def forward(self, x, y): return x - y inputs = (torch.randn(4, 4), torch.randn(4, 4)) shapes = { "x": (Dim.AUTO, Dim("d1", min=3)), "y": (Dim("d0", max=8), Dim.DYNAMIC), } ep = export(Foo(), inputs, dynamic_shapes=shapes) ``` The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant. Differential Revision: D64636101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490 Approved by: https://github.com/avikchaudhuri	2024-10-28 10:37:23 -07:00
drisspg	01e43190a3	[SDPA-CUDNN] Make CuDNN Attention Opt in (#138522 ) # Summary Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5: 1. https://github.com/pytorch/pytorch/issues/138529 2. https://github.com/huggingface/diffusers/issues/9704 3. https://github.com/pytorch/pytorch/pull/138354 In light of the above we are going to make the CuDNN backend Opt-in by default. This can be done easily with the context manager for choosing backends I.e.: ``` Python from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): out = F.scaled_dot_product_attention(q, k, v) ``` This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager). Cc @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/138522 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet	2024-10-28 10:37:23 -07:00
Gabriel Ferns	debc2170c0	Mark torch.get_device as overridable at the python level (#132706 ) Summary: - add a value to `get_testing_overrides` function for `torch.get_device()` - remove `torch.get_device()` from the `get_ignored_functions` list Test Plan: Existing override testing infra, which should pick up the updates to these two variables. Closes the loop on: https://github.com/pytorch/pytorch/pull/132567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132706 Approved by: https://github.com/ezyang	2024-10-28 10:37:23 -07:00
Pian Pawakapan	dcd776c0ad	bug in unbacked_bindings for au0 (#138136 ) Summary: we were storing au0 instead of u0 in unbacked_bindings / unbacked_var_to_val Test Plan: - Differential Revision: D64508936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138136 Approved by: https://github.com/ezyang	2024-10-28 10:37:23 -07:00
Sam Larsen	91ac179599	[pt2] Log is_forward field to dynamo_compile scuba table (#138505 ) Differential Revision: [D64711721](https://our.internmc.facebook.com/intern/diff/D64711721) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138505 Approved by: https://github.com/oulgen	2024-10-28 10:37:23 -07:00
Chien-Chin Huang	356bd932d8	[CP] Implement AllGather based context parallelism (#132820 ) Summary: This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820 Approved by: https://github.com/XilunWu	2024-10-28 10:37:23 -07:00
Ke Wen	8a04f3e0ff	[PGNCCL] Add default value for `nccl_nonblocking_timeout` (#138374 ) - Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1). - Reuse C10D_CHECK_TIMEOUT in other CHECK macros Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374 Approved by: https://github.com/eqy ghstack dependencies: #137855, #138488	2024-10-28 10:37:23 -07:00
Syed Tousif Ahmed	ba828d0fc6	Properly uses ref-counting for torch.cuda.use_mem_pool (#133600 ) This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`. The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-10-28 10:37:23 -07:00
Colin Peppler	fedbbd83dd	[easy] in ROCmTemplate set kwargs when creating Buffer (#138521 ) Summary: https://github.com/pytorch/pytorch/pull/137768 makes Inductor IR kw only Test Plan: CI Differential Revision: D64723804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138521 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-28 10:37:23 -07:00
cyy	828536de0f	Use Wmissing-prototypes on torch_cuda (#136080 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136080 Approved by: https://github.com/ezyang	2024-10-28 10:37:23 -07:00
Tugsbayasgalan Manlaibaatar	4155456b85	Fix training IR bug by changing passes order (#138292 ) Inserting runtime_assertions cause gm to have different names but the graph signature was populated earlier. To avoid this kind of errors in the future, I refactored these steps into a helper function. Differential Revision: [D64576251](https://our.internmc.facebook.com/intern/diff/D64576251) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138292 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #138266	2024-10-28 10:37:23 -07:00
Sergii Dymchenko	8f3f9044a0	Don't try to load cufile (#138501 ) Trying to loading it caused a big issue with 2.5.0 release - https://github.com/pytorch/pytorch/issues/138324 cufile is not actually used currently by default, see #133489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138501 Approved by: https://github.com/atalman, https://github.com/mikaylagawarecki, https://github.com/malfet	2024-10-28 10:37:23 -07:00
Tugsbayasgalan Manlaibaatar	56e3472fea	Training IR should preserve custom metadata (#138266 ) Differential Revision: [D64576252](https://our.internmc.facebook.com/intern/diff/D64576252) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138266 Approved by: https://github.com/yushangdi	2024-10-28 10:37:23 -07:00
Shunting Zhang	73121c8521	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-28 10:37:23 -07:00
PyTorch MergeBot	03da94b981	Revert "[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 )" This reverts commit 1417b2cd0562e0e4d4349024ef7c731b99214890. Reverted https://github.com/pytorch/pytorch/pull/138303 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138303#issuecomment-2427991065))	2024-10-28 10:37:23 -07:00
wz337	c7e063c42c	[DeviceMesh] Use `split_group` to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129 ) Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129 Approved by: https://github.com/kwen2501	2024-10-28 10:37:23 -07:00
Matthew Francis-Landau	39169e2e44	Fixes issue with enums in a tuple for dynamo (#133123 ) Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`. This causes an issue if one of the values inside of the tuple will not be properly encoded. In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123 Approved by: https://github.com/jansel	2024-10-28 10:37:23 -07:00
Mikayla Gawarecki	d8e5d7ddea	Add environment variable to force no weights_only load (#138225 ) In preparation for `weights_only` flip, if users don't have access to the `torch.load` call Pull Request resolved: https://github.com/pytorch/pytorch/pull/138225 Approved by: https://github.com/albanD	2024-10-28 10:37:23 -07:00
Will Feng	5a04072af5	[Traceable FSDP2][CI] Skip more tests on rocm (#138497 ) Some of the test checks doesn't work well with rocm. Fixes https://github.com/pytorch/pytorch/issues/138409. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138497 Approved by: https://github.com/fduwjj	2024-10-28 10:37:23 -07:00
Animesh Jain	a3d27ce9f2	[inductor][subgraph] Add size asserts (#138424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138424 Approved by: https://github.com/eellison ghstack dependencies: #137555	2024-10-28 10:37:23 -07:00
Parikshit Shah	715092be96	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong Co-authored-by: Huy Do <huydhn@gmail.com>	2024-10-28 10:37:23 -07:00
Bob Ren	99d88c7c46	Log all failing test repros to scuba (#138394 ) This has the benefit that 1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba 2) We can do analysis (eg. set different two sets of tests across two PRs) 3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH. I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9 I then reverted the breaking change and published this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394 Approved by: https://github.com/ezyang	2024-10-28 10:37:23 -07:00
mwlon	55c1b7229a	More appropriate socket errors and debug messages (#130347 ) Fixes #128998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130347 Approved by: https://github.com/fduwjj	2024-10-28 10:37:22 -07:00
Ke Wen	2772c6d4aa	[Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR (#138488 ) Forward fix for build issue introduced by #137855: ``` In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2: fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR' 508 \| int split_color{NCCL_SPLIT_NOCOLOR - 1}; \| ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488 Approved by: https://github.com/fduwjj ghstack dependencies: #137855	2024-10-28 10:37:22 -07:00
Joel Schlosser	8cd8e3833c	Support record_stream() for NJT (#137099 ) Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete. This is an ask from torchrec as the op is used there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099 Approved by: https://github.com/ngimel	2024-10-28 10:37:22 -07:00
Richard Barnes	e96e531ac3	Remove C10_DEPRECATED (#138406 ) Looking in the code I see ``` // NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses // the "__declspec(deprecated)" implementation and not the C++14 // "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on // MSVC, but ran into issues with some older MSVC versions. ``` But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_. Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`. Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-10-28 10:37:22 -07:00
David Berard	40e2169c40	[user triton] typing triton_kernel_wrap.py (#138230 ) Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-10-28 10:37:22 -07:00
atalman	12f2d34f12	Use cuda 12.4 pytorch_extra_install_requirements as default (#138458 ) Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4. This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458 Approved by: https://github.com/malfet	2024-10-28 10:37:22 -07:00
Tom Ritchford	8ad191ae21	[dynamo] Replace __str__ with __repr__ in some places (#136316 ) ## The problem In a typical debugger, `repr()` is used to display variables and not `str()`. Several classes in Dynamo have a `__str__()` method that returns useful information and a `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore. `str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)". ## The solution In the Python object model, if there is no `__str__` method, `__repr__` is used instead (but not the other way around). So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier. The specific classes changed were all in `torch._dynamo.variables`: * `builtin.BuiltinVariable` * `constant.ConstantVariable` * `constant.EnumVariable` * `functions.UserMethodVariable` * `lazy.LazyVariableTracker` * `lazy.LazySymNodeFormatString` * `misc.GetAttrVariable` * `misc.NullVariable` * `user_defined.UserDefinedObjectVariable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136316 Approved by: https://github.com/XuehaiPan, https://github.com/jansel	2024-10-21 19:50:38 +00:00
Huy Do	41f7d01ccf	Increase Docker push timeout limit from 15 to 30m (#138487 ) Some images now take more than 15 to finish pushing and keep timing out, for example, https://github.com/pytorch/pytorch/actions/runs/11442231435/job/31832143440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138487 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/ZainRizvi	2024-10-21 19:44:52 +00:00
PyTorch MergeBot	32d4582e02	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit 16caa8c1b3a02e47b5f52d3c2d40d7931cc427dc. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to checking if this will solve inductor errors ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2427565425))	2024-10-21 19:40:58 +00:00
Xuehai Pan	ff2f751bfb	[tools] fix nightly pull tool when the conda environment not exists (#138448 ) Now, `conda env remove --name env` exits with errors if the given environment does not exist. This PR check the existance of the environment before trying to remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138448 Approved by: https://github.com/ezyang	2024-10-21 19:35:48 +00:00
PyTorch MergeBot	071f6f2de8	Revert "[ROCm] Fix ADDMM hipBLASLt regression (#138267 )" This reverts commit 14a3e12985e4550440a8a1755d3418e9b02b4950. Reverted https://github.com/pytorch/pytorch/pull/138267 on behalf of https://github.com/jeffdaily due to this PR went to far when partially reverting #137604; the env var default should be the same on ROCm and CUDA ([comment](https://github.com/pytorch/pytorch/pull/138267#issuecomment-2427550465))	2024-10-21 19:33:13 +00:00
Xuehai Pan	abbd71d29d	[BE][Easy] enable PYFMT for `torch.fx` (#138443 ) Reproduce command: ```bash ghstack checkout https://github.com/pytorch/pytorch/pull/138443 git checkout HEAD~1 torch/ lintrunner -a --take "PYFMT" --all-files ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138443 Approved by: https://github.com/ezyang	2024-10-21 19:15:49 +00:00
Animesh Jain	8231180147	[dynamo][refactor] Refactor Wrap HOP to reuse it for invoke_subgraph (#137555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137555 Approved by: https://github.com/zou3519	2024-10-21 18:26:29 +00:00
Justin Chu	c6609ece84	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms, https://github.com/xadupre ghstack dependencies: #137789	2024-10-21 18:17:48 +00:00
Aaron Orenstein	07cc4bd3e2	typing compile_fx.py (#138033 ) Type annotations for compile_fx. - Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed. - There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033 Approved by: https://github.com/Skylion007	2024-10-21 18:14:59 +00:00
Will Feng	81738403a2	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-21 17:52:21 +00:00
Justin Chu	6e38c87ad0	[ONNX] Remove ExportTypes (#137789 ) Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward. Differential Revision: [D64412947](https://our.internmc.facebook.com/intern/diff/D64412947) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2024-10-21 17:50:28 +00:00
FFFrog	af0bc75460	Remove deprecated alias macro(1/3) (#137556 ) Detailed Descriptions: - Remove AT_ERROR Macro Pull Request resolved: https://github.com/pytorch/pytorch/pull/137556 Approved by: https://github.com/ezyang	2024-10-21 17:32:32 +00:00
Aaron Gokaslan	16caa8c1b3	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-10-21 17:20:06 +00:00
PyTorch MergeBot	9bb327bfc6	Revert "[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 )" This reverts commit a8b912f39d36bd2e6d204808d866439d0075f1a5. Reverted https://github.com/pytorch/pytorch/pull/137785 on behalf of https://github.com/ezyang due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/137785#issuecomment-2427295668))	2024-10-21 17:18:56 +00:00
Ryan Guo	02dd3b8e32	[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 ) This method was no longer needed after #113725; the checking logic is now in `SideEffects.check_allowed_side_effect`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #137905	2024-10-21 16:43:34 +00:00
Catherine Lee	1032ce6bd3	Only upload test/test-reports as artifacts (#138019 ) Fixes https://github.com/pytorch/pytorch/issues/137851 This is possibly too restrictive but I spot checked and I don't think any of the files outside of test/test-reports are important, but I can't guarantee that someone was putting something elsewhere and expecting for it to still be zipped Outputs can be see on HUD by clicking show artifacts Some examples: Logs <img width="293" alt="image" src="https://github.com/user-attachments/assets/9a2db9b1-0f62-4209-909b-4f56a908619d"> XMLs <img width="234" alt="image" src="https://github.com/user-attachments/assets/a639fe38-a112-4ea5-abba-ad1d5b25bb43"> JSONs <img width="180" alt="image" src="https://github.com/user-attachments/assets/be7a49ac-5258-4bc5-981d-3f134ebd343d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138019 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2024-10-21 16:43:30 +00:00
Ryan Guo	0a4197490c	Delay mul/pow expansion for `_SympyT` to enable more folding (#138235 ) Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g., ``` (a + b)^2 / (a + b) --> (a + b) ``` which won't happen if we expand eagerly during product construction: ``` (a^2 + 2ab + b^2) / (a + b) --> no change ``` Fixes #136044. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235 Approved by: https://github.com/ezyang	2024-10-21 16:38:47 +00:00
David Berard	701ddf962a	[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 ) replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example. This also adds metadata for to register_replacement patterns, including pad_mm. This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now. Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089 Approved by: https://github.com/aakhundov	2024-10-21 16:33:12 +00:00
Yuanhao Ji	279ddfc6ee	Add type check for `dilation` in `torch.quantized_max_pool3d()` (#137845 ) Fixes #136716 repro: ```python import torch input = torch.randn([1, 1, 1, 1, 1]) input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32) torch.quantized_max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash input = torch.randn([1, 1, 1, 1, 1]) input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32) result = torch.nn.functional.max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash ``` result: ``` RuntimeError: Expected dilation >= 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137845 Approved by: https://github.com/albanD	2024-10-21 16:15:57 +00:00
Parikshit Shah	a8b912f39d	[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785 ) Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (`f7b8d36c28`). We think the unit test is buggy, and we also fix the same. Test Plan: tbd Differential Revision: D64246495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785 Approved by: https://github.com/basilwong	2024-10-21 15:30:07 +00:00
cyy	7ec21a6f0f	Enable clang-tidy on torch/csrc/api (#138437 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138437 Approved by: https://github.com/r-barnes	2024-10-21 14:22:38 +00:00
FFFrog	8aacbee8e0	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang ghstack dependencies: #138323	2024-10-21 13:51:54 +00:00
FFFrog	649f8117ad	Add deprecated warning for lazyInitXXX API (#138323 ) Detailed Descriptions: Involved APIs are as followed: - ``lazyInitCUDA`` - ``lazyInitHIP`` - ``lazyInitXPU`` - ``lazyInitMTIA`` - ``lazyInitPrivateUse1`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138323 Approved by: https://github.com/malfet	2024-10-21 13:51:54 +00:00
Bin Bao	1417b2cd05	[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 ) Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303 Approved by: https://github.com/hl475	2024-10-21 13:47:50 +00:00
PyTorch UpdateBot	8f3efb8797	Update slow tests (#133203 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weeekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133203 Approved by: https://github.com/pytorchbot	2024-10-21 12:00:52 +00:00
cyy	14fc6b70ea	Remove torch/csrc/api/include/torch/linalg.h (#138435 ) Only one place in OSS uses it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138435 Approved by: https://github.com/r-barnes	2024-10-21 07:04:27 +00:00
Xiaodong Wang	5f940a44af	[AMD] Fix torch ck backend build with 6.2.1 (#138434 ) Summary: It's complaining about missing __hip_bfloat162 definition w/o this header. Differential Revision: D64673284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138434 Approved by: https://github.com/yaoyj11, https://github.com/houseroad	2024-10-21 06:38:38 +00:00
Will Feng	362ca54f03	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_work_registry` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_work_registry` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D64511994](https://our.internmc.facebook.com/intern/diff/D64511994) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-21 06:02:57 +00:00
cyy	a170ff4167	Prepare to enable ASAN on CUDA (#138404 ) See which tests fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/138404 Approved by: https://github.com/ezyang	2024-10-21 03:55:29 +00:00
Richard Barnes	9ad2736627	Remove extraneous C++14 comment (#138408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138408 Approved by: https://github.com/Skylion007	2024-10-21 03:54:41 +00:00
PyTorch MergeBot	6987bfb40a	Revert "[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 )" This reverts commit 3c7d9d6c7fa565e811675be7dd84e5ef7c8ba7a0. Reverted https://github.com/pytorch/pytorch/pull/137906 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137906#issuecomment-2425505452))	2024-10-21 03:42:38 +00:00
wz337	fb0da32377	[DeviceMesh] Small refactor to optimize DeviceMesh subgroup creation (#138117 ) As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension). For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117 Approved by: https://github.com/kwen2501	2024-10-21 03:04:24 +00:00
cyy	a05b64a38f	[5/N] Fix extra warnings brought by clang-tidy-17 (#138403 ) Follows #137983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138403 Approved by: https://github.com/ezyang	2024-10-21 02:59:54 +00:00
cyy	82eb09aafd	[Environment Variable][4/N] Use thread-safe getenv functions (#137843 ) Follows #137328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843 Approved by: https://github.com/ezyang	2024-10-21 02:58:59 +00:00
Shuqiang Zhang	2d3455e7d9	[c10d] try fix the unstableness of test_get_future_result (#138415 ) Summary: Seems depends on the platform, nccl error or timeout would be raised first on rank 0. Now we try to force the timeout by not exiting other ranks Test Plan: Tests pass locally Tags: Fixes https://github.com/pytorch/pytorch/issues/138397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138415 Approved by: https://github.com/kwen2501	2024-10-21 01:17:30 +00:00
cyy	e7b8a9a4c1	[5/N] Fix clang-tidy warnings in torch/csrc/api/ (#138389 ) Follows #138382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138389 Approved by: https://github.com/ezyang	2024-10-21 01:12:37 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
Isuru Fernando	4f45a052ad	Fix try_solve for s1*s2 == 0 when both symbols are unknown (#137919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137919 Approved by: https://github.com/ezyang	2024-10-20 23:33:08 +00:00
Alnis Murtovi	09cf163ae3	Fix for mixed_mm tests failures on SM70 and lower (#138183 ) This PR fixes mixed_mm tests that are failing on SM70 and lower as discussed here https://github.com/pytorch/pytorch/pull/123762#issuecomment-2406601729. The failure occurs because some of the mixed_mm tests expect triton code to be generated, but on SM70 and lower, the generation of triton code is skipped (see https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L693). These tests will now be skipped when running on SM70 and lower. I do not have access to an SM70 GPU, so I was not able to test these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138183 Approved by: https://github.com/ezyang	2024-10-20 21:14:31 +00:00
PyTorch MergeBot	a1899b5a9e	Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843 )" This reverts commit 239ad73cb1c8a91f0a2de21d27af3d98f5a8dddc. Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/yf225 due to Sorry for reverting your PR but I believe this PR breaks the binary builds. Example: https://ossci-raw-job-status.s3.amazonaws.com/log/31790258895, with error message: `getenv is not a member of c10::utils`, might be easier to search for `not a member of` in the log ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2425192780))	2024-10-20 19:48:14 +00:00
Will Feng	a9f4f89cd5	[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501	2024-10-20 19:38:18 +00:00
cyy	239ad73cb1	[Environment Variable][4/N] Use thread-safe getenv functions (#137843 ) Follows #137328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843 Approved by: https://github.com/ezyang	2024-10-20 13:05:04 +00:00
drisspg	07fd61e106	[SDPA] Fix warning message (#138278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138278 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-10-20 08:00:56 +00:00
Huy Do	f568d48890	Enable git long paths checkout on Windows (#138411 ) Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376 According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround. ### Testing Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411 Approved by: https://github.com/wdvr	2024-10-20 07:18:44 +00:00
PyTorch MergeBot	f8303740f7	Revert "Enable git long paths checkout on Windows (#138411 )" This reverts commit 12283035f8c08cd3487bfaac25ccef7da90952ba. Reverted https://github.com/pytorch/pytorch/pull/138411 on behalf of https://github.com/huydhn due to Opps, I forgot Windows binary build, let me revert and reland this one ([comment](https://github.com/pytorch/pytorch/pull/138411#issuecomment-2424661640))	2024-10-20 06:50:48 +00:00
Huy Do	12283035f8	Enable git long paths checkout on Windows (#138411 ) Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376 According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround. ### Testing Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411 Approved by: https://github.com/wdvr	2024-10-20 06:32:34 +00:00
PyTorch MergeBot	d1027c2be6	Revert "Update sympy version constraint to 1.13.3 (#138338 )" This reverts commit d8279ad9d162b5ce71699f462d3664c3745b14f5. Reverted https://github.com/pytorch/pytorch/pull/138338 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think a bunch of inductor tests and test_dynamic_shapes are failing in trunk after this lands `d8279ad9d1` ([comment](https://github.com/pytorch/pytorch/pull/138338#issuecomment-2424487225))	2024-10-20 03:19:02 +00:00
Jeff Daily	3f3b692a00	[ROCm] CK-based GEMM (#131004 ) - composable_kernel as a third_party submodule - "ck" as a `torch.backends.cuda.preferred_linalg_library()` - reference CK gemm implementations for float, bfloat16, and half types Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004 Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2024-10-20 02:57:43 +00:00
Animesh Jain	0a2407b93c	[dynamo] Support omegaconf DictConfig (#138378 ) Fixes https://github.com/pytorch/pytorch/issues/138224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378 Approved by: https://github.com/jansel ghstack dependencies: #138359	2024-10-20 02:43:17 +00:00
Animesh Jain	f892543c1f	[dynamo] Support TypedDict (#138359 ) Seen in vLLM. Fixes https://github.com/pytorch/pytorch/issues/132629 Fixes https://github.com/pytorch/pytorch/issues/133613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138359 Approved by: https://github.com/jansel Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-20 02:43:17 +00:00
cyy	1f349eed61	[4/N] Fix extra warnings brought by clang-tidy-17 (#137983 ) Follows #137552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137983 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-10-20 01:02:33 +00:00
Richard Barnes	b1b7c714ed	Add deprecated C10_UNUSED and C10_NODISCARD macros back (#138398 ) For backwards compatibility. Disallow internal use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138398 Approved by: https://github.com/malfet	2024-10-20 00:21:19 +00:00
Jeongseok (JS) Lee	d8279ad9d1	Update sympy version constraint to 1.13.3 (#138338 ) `simpy` was pinned to version 1.13.1 due to test failures with version 1.13.2 on Windows and mac, as reported in https://github.com/pytorch/pytorch/pull/133235. Now that a newer version, 1.13.3, has been released, this PR aims to verify if the test failure has been resolved and also allow building with newer versions for packaging purposes (e.g., https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277#discussion_r1806721862). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138338 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-20 00:20:02 +00:00
Nichols A. Romero	14a3e12985	[ROCm] Fix ADDMM hipBLASLt regression (#138267 ) Fixes #138067 A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604 The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267 Approved by: https://github.com/malfet	2024-10-20 00:19:10 +00:00
PyTorch MergeBot	47e80abc7a	Revert "[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 )" This reverts commit fb44658415e50b5be6a187ff3f14243c0fdf3daf. Reverted https://github.com/pytorch/pytorch/pull/138089 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_original_aten_preserved_pad_mm test runs OOM in trunk `fb44658415` ([comment](https://github.com/pytorch/pytorch/pull/138089#issuecomment-2424297269))	2024-10-19 23:55:01 +00:00
Will Feng	fcedf93d1e	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu	2024-10-19 19:10:31 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
David Berard	fb44658415	[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089 ) replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example. This also adds metadata for to register_replacement patterns, including pad_mm. This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now. Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089 Approved by: https://github.com/aakhundov	2024-10-19 16:37:08 +00:00
Bob Ren	38ea487338	Re-raise in _run_sympy_handler to reduce log spew (#138356 ) Fixes: https://github.com/pytorch/pytorch/issues/138069 I tested this by running `python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu` before and after the change and verifying no more log spew. I'm uncertain on if it makes sense to add a test for this PR. Question for reviewers: is there a standard paradigm for testing these log spew based fixed? Happy to add a test if someone can point me towards the right direction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138356 Approved by: https://github.com/ezyang	2024-10-19 16:02:45 +00:00
Nikita Shulga	c0879d0c21	Fix lint Regression casued by `fddabc6e0b` that was force merged	2024-10-19 08:33:41 -07:00
cyy	cdc9f14227	[4/N] Fix clang-tidy warnings in torch/csrc/api/ (#138382 ) Follows #138328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138382 Approved by: https://github.com/ezyang	2024-10-19 13:32:51 +00:00
Richard Barnes	fddabc6e0b	C10_UNUSED to [[maybe_unused]] (#6357 ) (#138364 ) Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-10-19 13:17:43 +00:00
cyy	2f6a70bfea	Enable more UBSAN checks (#138288 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138288 Approved by: https://github.com/ezyang	2024-10-19 13:00:26 +00:00
cyy	675e16e137	[3/N] Fix clang-tidy warnings in torch/csrc/api/ (#138328 ) Follows #136998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138328 Approved by: https://github.com/ezyang	2024-10-19 07:07:39 +00:00
PyTorch MergeBot	795255a7c8	Revert "[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 )" This reverts commit 0c913b35aaea9ca33510239e939957ec5fe66d78. Reverted https://github.com/pytorch/pytorch/pull/138187 on behalf of https://github.com/yf225 due to linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) test_compiled_autograd_ctx failed ([comment](https://github.com/pytorch/pytorch/pull/138187#issuecomment-2423609108))	2024-10-19 06:12:47 +00:00
Nikita Shulga	de16159e56	[MPS] Fix sliced cast (#138314 ) This fixes internal crash due to the invalid bufer size computation if sliced API is used Not sure what was the purpose of ```c++ IntArrayRef baseShape; if (src.is_view()) { baseShape = src._base().sizes(); } else { baseShape = getIMPSAllocator()->getBufferShape(src.storage().data()); } int flattenedShaped = 1; for (const auto i : c10::irange(baseShape.size())) { flattenedShaped *= baseShape[i]; } ``` As flattenShaped could be much easier computed as `[srcBuf lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do. When someone allocated buffer to hold say uint8 and that view-casted it to float16, attempt to compute `baseShape` returned sizes of original tensor in its data type, rather than size in new dtypes Fixes https://github.com/pytorch/pytorch/issues/137800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314 Approved by: https://github.com/albanD, https://github.com/DenisVieriu97	2024-10-19 05:17:09 +00:00
Will Feng	0c913b35aa	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu ghstack dependencies: #138245, #138174	2024-10-19 04:33:35 +00:00
Will Feng	8f118e53d7	[CI] Fix CompiledDDP failure when the gradient is not contiguous; Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138174 ) Summary: As title `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174 Approved by: https://github.com/yf225, https://github.com/kwen2501 ghstack dependencies: #138245 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-19 04:33:35 +00:00
Jeongseok Lee	3cfd244495	Add USE_SYSTEM_NVTX option (#138287 ) ## Summary We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages. This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior). ## Test Plan The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](`cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287 Approved by: https://github.com/albanD	2024-10-19 04:26:01 +00:00
Michael Lazos	a20a17fd6f	[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 ) Fixes https://github.com/pytorch/pytorch/issues/114369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669 Approved by: https://github.com/anijain2305	2024-10-19 04:12:45 +00:00
PyTorch UpdateBot	88eb15a3e3	[audio hash update] update the pinned audio hash (#138139 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138139 Approved by: https://github.com/pytorchbot	2024-10-19 04:02:21 +00:00
Wouter Devriendt	7d076b9e3a	updated EC2 fetching of metadata to use IMDSv2 (#138286 )	2024-10-18 20:58:47 -07:00
PyTorch MergeBot	ac7f52b301	Revert "[inductor] add a threshold for membw saving during fusion (#136782 )" This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8. Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))	2024-10-19 03:43:42 +00:00
Ke Wen	fecd370ea1	[c10d] Fix color value for comm split being negative (#137855 ) Fixes https://github.com/pytorch/pytorch/issues/137856. ### Issue 1 Today under `ProcessGroupNCCL::Options`, color is declared as: ``` int64_t split_color{0}; ``` When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`). But that's not all. ### Issue 2 `split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain: ``` [rank0]: TypeError: (): incompatible function arguments. The following argument types are supported: [rank0]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None ``` This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855 Approved by: https://github.com/wconstab	2024-10-19 03:17:19 +00:00
Richard Barnes	542f7c8383	Eliminate C10_NODISCARD (#138336 ) Test Plan: Sandcastle Reviewed By: swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336 Approved by: https://github.com/Skylion007	2024-10-19 02:54:06 +00:00
fduwjj	a4b6ef178c	[c10d] Reorder cpp stack dump and FR dump and add log prefix to loggings (#138368 ) The rationale behind this PR is to: 1. Move the dump of c++ traces after FR dump because the FR dump is timed meaning that it will not block forever, while the dumping of c++ traces is likely to be blocking. so that we swap the order. Ideally we also want to make cpp stacktrace dump to be a future wait, if we want to go down this path, we can also make it happen in an another PR. 2. Add log Prefix to the logs which have not been added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138368 Approved by: https://github.com/c-p-i-o	2024-10-19 02:43:41 +00:00
Rachel Guo	ea412d5554	[AOTI] Fix a special case compile time data type codegen for sym int variables (#138106 ) Summary: This change unblocks the CFR AOTI lowering runtime error. TL;DR: In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto" can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA. Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. This diff manually cast it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code. Test Plan: Verified in FLB locally: ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16 --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"``` Differential Revision: D64490039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138106 Approved by: https://github.com/ColinPeppler	2024-10-19 02:30:53 +00:00
Xu Han	d5035f0aab	fix codecache write_atomic path issue on Windows. (#138331 ) Fixes #138211 `Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing. This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened. After stepping trace the repo code: ```python import os import sys from pathlib import Path _IS_WINDOWS = sys.platform == "win32" def test_case(): cwd = os.getcwd() path1 = os.path.join(cwd, "haha1.txt") path2 = Path(os.path.join(cwd, "haha2.txt")) try: path2.rename(path1) except FileExistsError as e_file_exist: if _IS_WINDOWS: # on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename shutil.copy2(path2, path1) os.remove(path2) else: raise e_file_exist except BaseException as e: raise e print("run here.") if __name__ == "__main__": test_case() ``` We found the code `path2.rename(path1)` can breakdown into: 1. copy file2's content to file1. 2. delete file2. So, we can implemented equal code on Windows path: ```python shutil.copy2(src=tmp_path, dst=path) os.remove(tmp_path) ``` So, we can get current PR. TODO: need cherry-pick to release/2.5 branch, CC: @atalman . Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-19 01:27:12 +00:00
Aleksei Nikiforov	949b6f685d	Enable -Werror on s390x (#136527 ) Enable -Werror on s390x Example of original issue on s390x: https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704 Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow. `-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe. `-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527 Approved by: https://github.com/malfet	2024-10-19 01:18:42 +00:00
Nikita Shulga	4a3c9400fe	Update cpuinfo submodule (#138351 ) To suppress error on ARM systems where PR_SVE_GET_VL is missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138351 Approved by: https://github.com/Skylion007	2024-10-19 01:12:29 +00:00
wz337	ff598f2f4d	[DTensorTestbase] Add an optional `eager_init` flag to `with_comms()` to support eager init nccl communicator for DeviceMesh test case (#138108 ) Add an optional `eager_init` flag to `with_comms`. When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization. Otherwise, `device_id` is still `None` and this goes through the normal lazy call. Default for `eager_init` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108 Approved by: https://github.com/kwen2501	2024-10-19 01:04:55 +00:00
nihui	b3ae1b1b73	[CMake] remove duplicated cmake options for Gloo and C10D (#138318 ) just a trival fix :P cmake options from line 345 to line 357 are identical to these of line 358 to line 369, remove the duplicated lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/138318 Approved by: https://github.com/janeyx99	2024-10-19 00:26:25 +00:00
Shunting Zhang	6647320de2	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-19 00:22:43 +00:00
PyTorch MergeBot	e8b1409dcf	Revert "[user triton] typing triton_kernel_wrap.py (#138230 )" This reverts commit 2f61b69603756c1fcaef71b231e598df31e20f42. Reverted https://github.com/pytorch/pytorch/pull/138230 on behalf of https://github.com/wdvr due to Reverting this, as it started failing tests on main ([comment](https://github.com/pytorch/pytorch/pull/138230#issuecomment-2423354596))	2024-10-18 23:12:29 +00:00
Jason Ansel	4632594546	[inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252 ) There are some places where it would be nice to use this, but the scheduler hasn't yet been created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252 Approved by: https://github.com/eellison ghstack dependencies: #138170	2024-10-18 23:05:54 +00:00
Jason Ansel	85a6a782e5	[inductor] Generalize WorkspaceArg for graph-level semaphores (#138170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170 Approved by: https://github.com/Chillee	2024-10-18 23:05:54 +00:00
Simon Fan	13bcb065f5	[compiled autograd] enable some reentrant tests (#137290 ) Some seem to fail due to queue_callback usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/137290 Approved by: https://github.com/yf225	2024-10-18 22:25:08 +00:00
PyTorch MergeBot	47e4045566	Revert "[pt2] Log is_forward field to dynamo_compile scuba table (#138097 )" This reverts commit 4e9273c84edafdcfff57521dde6675b967181ba8. Reverted https://github.com/pytorch/pytorch/pull/138097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a land race with https://github.com/pytorch/pytorch/pull/137803 ([comment](https://github.com/pytorch/pytorch/pull/138097#issuecomment-2423297516))	2024-10-18 22:00:40 +00:00
Aaron Shi	bd7cbddfe3	[CODEOWNERS] Remove aaronenyeshi from Profiler paths (#138346 ) As title, remove aaronenyeshi from Profiler paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138346 Approved by: https://github.com/sraikund16	2024-10-18 21:46:00 +00:00
Ke Wen	c88b77af9c	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 21:39:39 +00:00
Cen Zhao	7faa1284ab	[ptd][amd] call alltoallv instead of send/recv (#136368 ) Summary: as $title AMD provides a2av API, we should just use it instead of implementing PTD's own set of send/recv. we should not skip 0B send/recv within a2av, it may lead to dead lock: see details https://github.com/ROCm/rccl/pull/1349 Test Plan: before: mvai-job will timeout on all2all https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240913-1426-327e119d?job_attempt=1&version=0&env=PRODUCTION after: https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240919-1932-ebce94e6?job_attempt=0&tab=execution_details&env=PRODUCTION latest APS job: https://fburl.com/mlhub/vn6dj7zp Differential Revision: D63076315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136368 Approved by: https://github.com/xw285cornell	2024-10-18 21:31:57 +00:00
Shivam Raikundalia	5b58697cc7	[Profiler] Clang bugs in Collection [1/n] (#138296 ) Summary: I have to keep bypassing issues because of these clang rules. Let's start with all of the bugs instead of the variable name ones because that will introduce a lot of lines of code and can make things hard to read Test Plan: Format tests pass. Differential Revision: D64411171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138296 Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007	2024-10-18 21:06:50 +00:00
James Wu	295de00908	[PT2 Compile Events] Revamp PT2 Compile/chromium event logging [1/?] (#138093 ) This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709 It implements the following changes: - Only log spans to scuba, so no start events are ever logged - Log events as the full event name, without "START" or "END" - Only log to scuba major phases from chromium events. These are: - entire_frame_compile (dynamo) - backend_compile (aotdispatch) - inductor_compile (inductor) - codegen (inductor codegen) Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well: - When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested - Log new events for pre and post grad passes. These do not log to scuba. By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add much more metadata and information to each individual event type. Diffs for that will come later. IMPLEMENTATION NOTES: - The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one. - There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later. Differential Revision: [D64479033](https://our.internmc.facebook.com/intern/diff/D64479033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138093 Approved by: https://github.com/ezyang	2024-10-18 20:36:08 +00:00
Ryan Guo	3c7d9d6c7f	[dynamo][NFC] Remove unused method `InliningInstructionTranslator.check_replace_is_safe` (#137906 ) This method was no longer needed after #113725; the checking logic is now in `SideEffects.check_allowed_side_effect`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906 Approved by: https://github.com/Skylion007, https://github.com/anijain2305 ghstack dependencies: #137905	2024-10-18 20:20:42 +00:00
Ryan Guo	162eba2dee	[dynamo] Remove `mutable_local.source` and index on `VariableTracker` rather than `MutableLocalBase` (#137905 ) This patch addresses parts of the side-effect refactor proposed in #133027; specifically, it does 3 things: 1. Change `SideEffects.store_attr_mutations` and `PyCodegen.tempvars` to index on `VariableTracker` rather than `MutableLocalBase`. 2. Remove the `source` field from `MutableSideEffects` and `AttributeMutation`, and use `VariableTracker.source` instead. 3. Plumb a `overridden_sources: Dict[Source, Source]` from `handle_aliases_for_stolen_lists` to `PyCodegen` so that we don't update `VariableTracker.source` in place, while still preserving what `handle_aliases_for_stolen_lists` needed (i.e., modifying codegen for certain `VariableTracker`). (1) and (2) are merged in 1 patch because of some dependency between a. `OutputGraph.handle_aliases_for_stolen_lists` which iterates over `sideSideEffects.store_attr_mutations.keys()`, and potentially update its source field to be completely different. b. `SideEffects.codegen_update_mutated`, which happens after the above and uses `cg(var.mutable_local.source)`. where if we apply (1) only, (b) breaks, and if we apply (2) only, (a) breaks. (3) is needed for correctness, see comments in the PR for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137905 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos	2024-10-18 20:20:42 +00:00
PyTorch MergeBot	7b39fb5712	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit 9f81270d7589fd7fa98dc247ae4b1b7ab239ca3c. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))	2024-10-18 20:09:40 +00:00
Zain Rizvi	cd1e9b0e60	[EZ] Remove canary scale config (#138361 ) Removing just the LF canary scale config for now to test the changes in https://github.com/pytorch/test-infra/pull/5767 Those changes have been deployed to prod and appear to be working, but this will be the final proof that it is in fact reading the test-config version of scale-config and not the pytorch/pytorch copy. Note: This will break the Scale config validation workflow on test-infra, but it's worth it since this test will be very short lived and that workflow only runs when someone modifies scale config Pull Request resolved: https://github.com/pytorch/pytorch/pull/138361 Approved by: https://github.com/wdvr	2024-10-18 20:02:00 +00:00
Benjamin Glass	1ac42b5f3e	graph.py: Refine unspec variable finding (#137303 ) Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D. This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303 Approved by: https://github.com/eellison ghstack dependencies: #135701	2024-10-18 20:00:25 +00:00
Will Constable	d5bb70afe3	[Pipelining] Remove unnecessary {0,1} qualifier from regex (#138271 ) There should always be 1 action. This may be an artifact from trying to extend the regex to handle the fused SEND_F_RECV_B style actions, which was abandoned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138271 Approved by: https://github.com/H-Huang ghstack dependencies: #138142	2024-10-18 19:52:07 +00:00
Will Constable	f23e8a8923	[Pipelining] Fix/improve format_pipeline_order (#138142 ) Fix issue where format fn modified original data structure- avoid this. Change from printing "None" to empty string, for cleaner visualization of bubbles Pull Request resolved: https://github.com/pytorch/pytorch/pull/138142 Approved by: https://github.com/H-Huang	2024-10-18 19:52:07 +00:00
Chong Gu	d512d0e227	Always use aten.constant_pad_nd for mm padding (#137820 ) Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc. Test Plan: ``` buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` ``` buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR ``` Differential Revision: D64271583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820 Approved by: https://github.com/eellison	2024-10-18 19:35:03 +00:00
David Berard	2f61b69603	[user triton] typing triton_kernel_wrap.py (#138230 ) Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-10-18 19:29:31 +00:00
Tugsbayasgalan Manlaibaatar	1f32a1fb80	Replace torch.export default decomp table to be lazily populated (#137650 ) In this PR, we implement lazy dictionary for export decomp behaviour for following reasons: 1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible. I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions) Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650 Approved by: https://github.com/justinchuby, https://github.com/bdhirsh	2024-10-18 19:28:52 +00:00
Nikita Shulga	ea8ea2f33f	Improve build_with_deb_info (#138290 ) To skip over the command that do not have output file specified Recently I've noticed that `generate_torch_version.py` started to run on every rebuild, and this results in a failed plan for deb info rebuilds Pull Request resolved: https://github.com/pytorch/pytorch/pull/138290 Approved by: https://github.com/Skylion007	2024-10-18 18:50:12 +00:00
Sam Larsen	4e9273c84e	[pt2] Log is_forward field to dynamo_compile scuba table (#138097 ) Summary: ^^ Test Plan: Ran a test script out of fbcode: D64350202. Then: ``` (pytorch-3.10_4) devvm2296:~/fbcode $ scuba -e="select time,co_filename,is_forward from \`dynamo_compile/sandbox\` where is_forward is not null" +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ \| time \| co_filename \| is_forward \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ \| 1729032583 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 1 \| \| 1729032583 \| null \| 0 \| \| 1729032650 \| /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py \| 1 \| \| 1729032650 \| null \| 0 \| +------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ 4 row(s) in set (0 warnings, 131 errors, 0.80 sec) ``` Reviewed By: ezyang Differential Revision: D64438144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138097 Approved by: https://github.com/ezyang	2024-10-18 18:48:52 +00:00
Aaron Gokaslan	195d0a666b	[BE][Ez]: Use interned hardcoded string FURB156 (#138330 ) Uses string constants from string module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330 Approved by: https://github.com/albanD	2024-10-18 18:26:16 +00:00
Svetlana Karslioglu	9c2a80322a	Add Programmable Google Search (#137716 ) - Adding the code for the programmable Google search - Adding the CSS overrides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137716 Approved by: https://github.com/seemethere, https://github.com/albanD Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-10-18 18:18:16 +00:00
Huy Do	8d869c9ec7	Skip test_circular_dependencies on ROCm (#138312 ) The test is flaky on ROCm and has been disabled for quite a while https://github.com/pytorch/pytorch/issues/110040. The disabled issue was opened and then closed several times, so it's better to close that issue and skip the test here. (Not really fix the issue, I just want the test to be skipped on PR instead of being disabled, then close the issue) Fixes https://github.com/pytorch/pytorch/issues/110040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138312 Approved by: https://github.com/jithunnair-amd, https://github.com/clee2000	2024-10-18 18:17:48 +00:00
Jason Ansel	620039c38c	[inductor] Respect ir_dataclass(frozen=...) in Python 3.9 (#138247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138247 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2024-10-18 17:55:12 +00:00
PyTorch MergeBot	ada7a8c217	Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 )" This reverts commit 8cb91109061648497ca09d6f1f9b9e13a2f5557e. Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/yf225 due to because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2422961292))	2024-10-18 17:51:54 +00:00
Ryan Guo	59158f640c	[dynamo] Support equality comparison between Tensor and `None` (#138289 ) This patch updates the `wrap_fx_proxy_cls` function to allow boolean output when the operation is one of `supported_const_comparison_op_values`. Fixes #120907. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138289 Approved by: https://github.com/williamwen42	2024-10-18 17:49:26 +00:00
Aaron Orenstein	9ea271d40b	Expand doc for bundled autotune cache (#138298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138298 Approved by: https://github.com/ezyang, https://github.com/oulgen	2024-10-18 17:43:47 +00:00
intellinjun	4bba038b2f	Add diagonal_copy to torch/_decomp/__init__.py (#136730 ) Fixes https://github.com/pytorch/pytorch/issues/117349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136730 Approved by: https://github.com/masnesral	2024-10-18 17:39:17 +00:00
Catherine Lee	666572d819	Update viable strict workflow (#138262 ) Corresponds to https://github.com/pytorch/test-infra/pull/5775 Tested in https://github.com/pytorch/pytorch/actions/runs/11393196544/job/31700963325?pr=138262 by adding my branch to the environment and pointing the workflow at my test-infra branch and commenting out the parts that did the push + upload record to s3 Versioning would have been good for this... Pull Request resolved: https://github.com/pytorch/pytorch/pull/138262 Approved by: https://github.com/huydhn	2024-10-18 17:28:55 +00:00
atalman	912ea5601b	Move manywheel binary scripts to pytorch (#138103 ) PR to remove Manywheel Scripts: https://github.com/pytorch/builder/pull/2017 Test PR : https://github.com/pytorch/pytorch/pull/138325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138103 Approved by: https://github.com/malfet	2024-10-18 17:11:28 +00:00
Li, Xingyuan	358ff3b731	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 1) (#136069 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_autoheuristic.py` reuse `test/inductor/test_b2b_gemm.py` reuse `test/inductor/test_custom_lowering.py` reuse `test/inductor/test_efficient_conv_bn_eval.py` reuse `test/inductor/test_group_batch_fusion.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136069 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-18 16:58:09 +00:00
Richard Barnes	8dd575faf6	[BE] Modernize `C10_UNUSED` (#138102 ) [`[[maybe_unused]]`](https://en.cppreference.com/w/cpp/language/attributes/maybe_unused) is part of C++17 standard Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138102 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/eqy	2024-10-18 16:33:01 +00:00
Wu, Chunyuan	de51ed8610	[AOTI] Add C shim for _mkl_linear (#137880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137880 Approved by: https://github.com/desertfire	2024-10-18 16:26:19 +00:00
PyTorch MergeBot	26ac5671dc	Revert "Fix CompiledDDP failure when the gradient is not contiguous (#138174 )" This reverts commit 0ecafda6024f50734118dd794ac71b86c6e6d569. Reverted https://github.com/pytorch/pytorch/pull/138174 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but I think it fails test_compute_comm_reordering in trunk for rocm and multigpu setup ([comment](https://github.com/pytorch/pytorch/pull/138174#issuecomment-2422818971))	2024-10-18 16:17:54 +00:00
Jean Schmidt	98856f7ea1	Increase max runners available for linux.12xlarge and windows.8xlarge.nvidia.gpu.nonephemeral (#138332 ) Related PR on test-infra: https://github.com/pytorch/test-infra/pull/5785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138332 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-10-18 16:17:36 +00:00
PyTorch MergeBot	af306a392c	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit 7a117f3b3eea4cfeef21da2e3a8a1e39c30fa07d. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to unfortunately the failures on the previous import are still present on the current one D64568703 ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2422789143))	2024-10-18 16:01:01 +00:00
ErezYosef	5a81475884	Documentation Update: Fix Missing Whitespace in Optimizer Docs (#138321 ) ### Description: This PR addresses a minor [formatting issue identified in a previous contribution to the Optimizer documentation](https://github.com/pytorch/pytorch/pull/134107#discussion_r1800833948). Specifically, it fixes the missing whitespace after `param_names` in the section on utilizing named parameters to load the optimizer state dict. You can find the related docs here: [Optimizer Documentation](https://pytorch.org/docs/main/optim.html#how-to-utilize-named-parameters-to-load-optimizer-state-dict). @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138321 Approved by: https://github.com/janeyx99	2024-10-18 15:41:43 +00:00
Aaron Orenstein	86aefa9405	typing subproc_pool.py (#138032 ) Added type annotations to subproc_pool.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138032 Approved by: https://github.com/Skylion007	2024-10-18 15:31:05 +00:00
Joona Havukainen	aa3ae50c07	Fixing MPS conv1d error message for output 2**16 (#134770 ) Fixes #134416 by removing the misleading message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134770 Approved by: https://github.com/malfet	2024-10-18 14:13:20 +00:00
albanD	c4ed03cea1	Add proper handling for view and factory function for csan (#138236 ) In particular, properly handle that some functions only read/write metadata on the Tensor and thus should not be detected as read/write by csan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138236 Approved by: https://github.com/ngimel	2024-10-18 14:04:18 +00:00
PyTorch MergeBot	0ff6f7a040	Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 )" This reverts commit 1581a93e8705dc23f649573d4404cd6816d614af. Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))	2024-10-18 13:21:17 +00:00
Xuan Zhang	e027403dea	ILP for Auto SAC (Selective Activation Checkpointing) (#137908 ) This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to apply activation checkpointing (AC) and the amount of activation memory that should be discarded for each module. The MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs: * https://github.com/pytorch/pytorch/pull/124688 * https://github.com/pytorch/pytorch/pull/134243 * https://github.com/pytorch/pytorch/pull/135208 End-to-end example and its sample output: ``` import copy from typing import Tuple import torch from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.ilp_utils import ( aggregate_stats, get_peak_memory_runtime_baseline, parse_module_info, ) from torch.distributed._tools.mem_tracker import _ModState, MemTracker from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.distributed._tools.sac_estimator import SACEstimator from torch.distributed._tools.sac_ilp import sac_milp from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) def _init_model_input_optimizer() -> Tuple[ torch.nn.Module, torch.optim.Optimizer, torch.Tensor ]: bsz = 8 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=8192, max_seq_len=1024, dim=768, dropout_p=0.1, ) with torch.device(torch.cuda.current_device()): model = Transformer(model_args) optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint( 0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=torch.cuda.current_device(), ) return (model, optimizer, inp) def _run_and_get_mem_tracker( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> MemTracker: mem_tracker = MemTracker() mem_tracker.track_external(model, optimizer) with mem_tracker as mt: for iter_idx in range(2): # running twice to initialize optimizer output = model(inp) output.sum().backward() if iter_idx == 1: last_snapshot = mt.get_tracker_snapshot("current") optimizer.step() optimizer.zero_grad() if iter_idx == 0: mt.reset_mod_stats() assert last_snapshot is not None for mod_stats in mem_tracker.memory_tracking.values(): if _ModState.POST_BW not in mod_stats.snapshots.keys(): mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append( copy.deepcopy(last_snapshot) ) return mem_tracker def _run_and_get_runtime_estimator( model: torch.nn.Module, optimizer: torch.optim.Optimizer, inp: torch.Tensor, ) -> RuntimeEstimator: def _run_one_step() -> None: output = model(inp) output.sum().backward() optimizer.step() optimizer.zero_grad() # Initializing optimizer states and warm-up _run_one_step() runtime_estimator = RuntimeEstimator() with runtime_estimator(estimate_mode_type="operator-level-cost-model"): _run_one_step() # We use only one iteration for estimation return runtime_estimator def _run_and_get_sac_estimator( model: torch.nn.Module, inp: torch.Tensor, ) -> SACEstimator: sac_estimator = SACEstimator() with sac_estimator(estimate_mode_type="operator-level-cost-model"): loss = model(inp).sum() loss.backward() return sac_estimator def main(): with FakeTensorMode(): model, optimizer, inp = _init_model_input_optimizer() mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp) runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp) sac_estimator = _run_and_get_sac_estimator(model, inp) mod_info = aggregate_stats( model, mem_tracker, runtime_estimator, sac_estimator, torch.device(torch.cuda.current_device()), ) g = parse_module_info(mod_info) peak_mem, compute_time = get_peak_memory_runtime_baseline(g) print("=== WITHOUT AC ===") print(f"peak_mem: {round(peak_mem / 230, 2)} GiB") print(f"compute_time: {round(compute_time, 2)} ms") ac_decisions, recomputation_time, peak_mem = sac_milp(g, memory_budget=1.75) print("=== WITH AC ===") print(f"ac_decisions: {ac_decisions}") print(f"peak_mem: {round(peak_mem / 230, 2)} GiB") print(f"recomputation_time: {recomputation_time} ms") if __name__ == "__main__": main() ``` ``` === WITHOUT AC === peak_mem: 2.41 GiB compute_time: 97.97 ms === WITH AC === ac_decisions: {'Transformer.layers.0': 0.5232, 'Transformer.layers.1': 0.5232, 'Transformer.layers.2': 0.6849, 'Transformer.layers.3': 0.5232} peak_mem: 1.75 GiB recomputation_time: 5.92 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137908 Approved by: https://github.com/weifengpy	2024-10-18 12:45:37 +00:00
zeshengzong	7b863230ea	[Docs] Optimize parameter description to declare allowed type (2/N) (#138152 ) Inspired by issue #137422 and #103847 Optimize method parameter types in docs to given users a more clear about what expected to pass to methods. Previous PR: - [x] https://github.com/pytorch/pytorch/pull/137956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138152 Approved by: https://github.com/albanD	2024-10-18 11:18:19 +00:00
Tom Ritchford	354bc3ac11	[dynamo] Remove an unused variable in repro.after_aot (#138094 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138094 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-10-18 09:37:10 +00:00
Tom Ritchford	e1c4548441	[dynamo] Simplify creation of VariableTrackers (#135714 ) ## `VariableTracker::build()` hides the Builders ### The problem In the current code, creating a `VariableTracker` involves choosing one of two `Builder` classes and either calling a method, or calling a constructor that creates an object that you immediately call, [like this](`083c9149b7/torch/_dynamo/variables/functions.py (L761-L768)`). Variations on this code are repeated in many places. More, the `Builder` classes have a lot of dependencies, so they have to be loaded late in the whole import process to avoid circular imports, so they end up being repeatedly imported at local scope. ### The solution In this commit, the import from `builder` and the logic of choosing and calling the Builder class are hidden in a single static factory method, `VariableTracker.build()`, easier to reason about and to import. This commit net lowers the total lines of code by over 150 lines by removing repetitive logic and unnecessary local imports. CHANGES: Originally the name of the static method was `VariableTracker.create()` but a static method on a derived class, `LazyVariableTracker.create()` now exists with a different signature that's irreconcilable, so the new static method was renamed to `VariableTracker.build()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135714 Approved by: https://github.com/jansel	2024-10-18 09:36:46 +00:00
Ke Wen	1581a93e87	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 09:10:01 +00:00
Will Feng	1a8b4c65ac	Fix scatter and gather shape check error message (#138310 ) The error message seems incorrect based on the surrounding code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138310 Approved by: https://github.com/Microve, https://github.com/fegin	2024-10-18 07:49:07 +00:00
Tugsbayasgalan Manlaibaatar	517012058d	Move test_db to training IR (#138251 ) Differential Revision: [D64560792](https://our.internmc.facebook.com/intern/diff/D64560792) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138251 Approved by: https://github.com/yushangdi ghstack dependencies: #138249	2024-10-18 07:42:13 +00:00
Tugsbayasgalan Manlaibaatar	29264fcbef	Move test_verifier to training IR (#138249 ) Differential Revision: [D64560351](https://our.internmc.facebook.com/intern/diff/D64560351) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138249 Approved by: https://github.com/yushangdi	2024-10-18 07:36:29 +00:00
Avik Chaudhuri	5d01126616	preserve module signature with multiple calls (#137999 ) Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places. Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export. Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.) Differential Revision: D64406945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999 Approved by: https://github.com/tugsbayasgalan	2024-10-18 07:30:22 +00:00
Jing Xu	14e6624473	Update wmic command used in collect_env.py to its counterpart in powershell due to its deprecation (#138297 ) As title. `wmic` is deprecated in Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138297 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-18 07:03:17 +00:00
Adnan Akhundov	d116d007ee	Add host-side Triton TMA support to Inductor (#137950 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument. - Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR. - `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel). - Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code). - New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA. - AOT Inductor support will be implemented in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950 Approved by: https://github.com/eellison ghstack dependencies: #137677	2024-10-18 06:27:24 +00:00
zeshengzong	82443798aa	[Distributed] Refactor compress hook to remove duplicated code (#138182 ) Fix TODO in code ```python # TODO: create an internal helper function and extract the duplicate code in FP16_compress and BF16_compress. ``` 1. Extract common logic in `fp16_compress_hook` and `bf16_compress_hook` to `_compress_hook` method 2. Let `fp16_compress_hook` and `bf16_compress_hook` invoke `_compress_hook` with difference `dtype` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138182 Approved by: https://github.com/awgu	2024-10-18 06:01:15 +00:00
Huy Do	80a58b7207	Use fresh cache directory in test_cudacodecache (#138243 ) This test frequently times out flakily, for example, https://github.com/pytorch/pytorch/actions/runs/11377972115/job/31654107609#step:22:2376. I still couldn't reproduce this behavior locally running this multiple times and in parallel. ~~So, I suspect that the error only shows up when other tests are run in paralel.~~ ~~I attempt to run this serially in this PR, once land, I can monitor trunk to see if this helps.~~ Running serially still ends up with a timing out https://github.com/pytorch/pytorch/actions/runs/11391445912/job/31697603438, another try with fresh cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138243 Approved by: https://github.com/clee2000	2024-10-18 05:45:39 +00:00
ur4t	0b168ceb6d	Collect Nvidia libraries with collect_env.py (#138076 ) Collect Nvidia libraries to diagnose issues like #133548. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138076 Approved by: https://github.com/malfet	2024-10-18 05:05:00 +00:00
Will Feng	8cb9110906	[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501	2024-10-18 04:58:58 +00:00
Nikita Shulga	a9014d2287	[BE][MPS] Compile without warnings on MacOS15 (#138238 ) By guarding the calls to `-[MTLCompileOptions setFastMathEnabled]` with `C10_DIAGNOSTIC_PUSH` and `POP` and using `-[MTLCompileOptions setMathMode:]` and `-[MTLCompileOptions setMathFloatingPointFunctions:]` on MacOS15 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138238 Approved by: https://github.com/atalman	2024-10-18 04:20:15 +00:00
Xingyuan Li	cc6c248919	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_inductor_freezing.py` reuse `test/inductor/test_layout_optim.py` reuse `test/inductor/test_loop_ordering.py` reuse `test/inductor/test_memory_planning.py` reuse `test/inductor/test_padding.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856 Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel	2024-10-18 03:58:00 +00:00
Nikita Lutsenko	c3cd9939fc	aten \| Deduplicate and silence set but unused variable warning. (#138270 ) Summary: Turns out we have two functions called slightly differently but they do exactly the same thing. Also silence the warning if the message is stripped out. Test Plan: Sandcastle, no behavior change. Differential Revision: D64566719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138270 Approved by: https://github.com/boguscoder, https://github.com/cyyever	2024-10-18 03:09:46 +00:00
William Wen	73a153b931	[dynamo] add compiler.set_stance raw function call test and doc example (#138276 ) Followup to https://github.com/pytorch/pytorch/pull/137504#issuecomment-2420107198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138276 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-10-18 02:54:22 +00:00
Animesh Jain	8b426d80dc	[hops][refactor] Refactor the aliasing/mutation detection functions (#138234 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138234 Approved by: https://github.com/ydwu4 ghstack dependencies: #138231	2024-10-18 02:35:00 +00:00
Animesh Jain	e714ebf664	[dynamo][testing] Update AOTEagerandRecordGraphs backend (#138231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138231 Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/aakhundov	2024-10-18 02:35:00 +00:00
Matt Pitkin	8a5dd7f59b	Allow SequentialLR to include ChainedScheduler (#133450 ) This fixes #132745 and allows a `SequentialLR` to include schedulers that are compound scheduler types (i.e., a `ChainedScheduler`), which contain a list of schedulers in a `_schedulers` attribute. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133450 Approved by: https://github.com/janeyx99	2024-10-18 02:29:38 +00:00
Yu, Guangye	8cda774a03	Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773 ) # Motivation Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-10-18 02:28:08 +00:00
Jerry Zhang	6d8c9be54b	[reland] Add int1 to int7 dtypes (#137928 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases for weight quantization Test Plan: python test/test_quantization.py -k test_uint4_int4_dtype Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D64344944](https://our.internmc.facebook.com/intern/diff/D64344944) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137928 Approved by: https://github.com/malfet	2024-10-18 02:02:08 +00:00
Mengwei Liu	7365a57dc0	[BC] Add check for core ATen opset schema BC (#137664 ) Summary: Based on core ATen opset BC policy: https://dev-discuss.pytorch.org/t/core-aten-opset-backward-forward-compatibility-policy/1772 Encorcing this policy in `check_forward_backward_compatibility.py`. Basically the script will error out if any BC breaking schema changes occurs to core ATen operators. Test Plan: Run `python test/forward_backward_compatibility/dump_all_function_schemas.py --filename nightly_schemas.txt` Manually added a argument to `nightly_schemas.txt`, `convolution` schema, see the following error: ``` [WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:329] Can NOT find backward compatible schemas after changes for schema aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor from the following candidates: [ aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups) -> Tensor aten::convolution.out(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, *, Tensor(a!) out) -> Tensor(a!) ]. Please contact PyTorch team to confirm if this BC breaking change is safe or not. ... [WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:342] The PR is introducing backward incompatible changes to core ATen operators. Please contact PyTorch team to confirm whether this change is wanted or not. Broken ops: [ aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor ] ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137664 Approved by: https://github.com/albanD	2024-10-18 01:58:33 +00:00
Shuqiang Zhang	21a9c06ca9	[c10d] differentiate timeout errors from nccl errors (#138240 ) Summary: Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths. It's important for c10d to differentiate different reasons of watchdog failures. E.g, timeout vs nccl errors, and possibly let users to handle the errors differently depends on the type of errors Test Plan: UT Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240 Approved by: https://github.com/Skylion007	2024-10-18 01:36:32 +00:00
Pian Pawakapan	95f869c3d7	[pytorch_operator_stats] log if using torchscript runtime (#137986 ) Summary: logs if an operator is run with the TorchScript runtime, using a thread_local variable set in `InterpreterState.run()` Test Plan: buck2 run mode/dev-nosan caffe2/torch/fb/observers:scuba_observer_runner Reviewed By: zou3519 Differential Revision: D64200781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137986 Approved by: https://github.com/angelayi	2024-10-18 00:55:22 +00:00
FFFrog	ad28565ed7	Use C++17 Convention Methods in PyTorch (#137958 ) Detailed Descriptions: - `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>` - `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` - and so on Pull Request resolved: https://github.com/pytorch/pytorch/pull/137958 Approved by: https://github.com/janeyx99	2024-10-18 00:52:51 +00:00
Nikita Lutsenko	b7cf8fb800	c10 \| Silence 'deprecated-dynamic-exception-spec' warning when importing cxxabi. (#138219 ) Summary: cxxabi header specifically from llvm violates this, ignore the warning when including it. Test Plan: No runtime behavior change, sandcastle only Differential Revision: D64540217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138219 Approved by: https://github.com/boguscoder	2024-10-18 00:42:45 +00:00
Will Feng	2f91d7c63f	[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 ) Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113 Approved by: https://github.com/xmfan	2024-10-18 00:13:00 +00:00
Sahan Paliskara	6d473e0dda	[autolint] move to use a label (#138263 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138263 Approved by: https://github.com/huydhn	2024-10-18 00:12:52 +00:00
Nikita Shulga	a3172809a1	[EZ] Fix typo in Normalization.mm (#138283 ) Introduced by `6b76a21ebd` One likely has to wait for 125 years to MacOS-150 release :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138283 Approved by: https://github.com/kit1980	2024-10-18 00:01:21 +00:00
Xiaodong Wang	b14c9b7250	[AMD] Hipify torchaudio_decoder (#138181 ) Summary: X-link: https://github.com/pytorch/audio/pull/3843 Continue to hipify more torchaudio targets. Test Plan: CI buck build mode/opt-amd-gpu pytorch/audio/src/... Differential Revision: D64298970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138181 Approved by: https://github.com/houseroad	2024-10-17 23:37:37 +00:00
Will Feng	0ecafda602	Fix CompiledDDP failure when the gradient is not contiguous (#138174 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174 Approved by: https://github.com/yf225, https://github.com/kwen2501 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-17 23:08:24 +00:00
Benjamin Glass	2fc6c32b4c	Ensure version file is regenerated at change (#138237 ) Fixes observed error where `version.py` would not be regenerated by CMake without deleting the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138237 Approved by: https://github.com/Skylion007	2024-10-17 22:46:05 +00:00
Xinya Zhang	770fcaf2ab	Fix the Rank of logsumexp Tensor and mGPU support. (#137717 ) The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes #131316 #137414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2024-10-17 21:58:14 +00:00
Tom Ritchford	9f81270d75	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-17 21:27:35 +00:00
albanD	69ba89da11	Fix cuda sanitizer and as_subclass calls (#138218 ) This fixes 4 main issues: - The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up. - Adds a "disable" method to be able to test after the mode is enabled. - Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it. - ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218 Approved by: https://github.com/ngimel	2024-10-17 21:18:32 +00:00
Edward Yang	b14269dcfb	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) (#138155 ) Summary: - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Original pull request: https://github.com/pytorch/pytorch/pull/136519 Test Plan: contbuild & OSS CI, see `4a8e49389c` Reviewed By: malfet Differential Revision: D64471142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155 Approved by: https://github.com/malfet, https://github.com/bobrenjc93	2024-10-17 20:58:56 +00:00
eellison	7a117f3b3e	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-17 19:24:54 +00:00
Jane Xu	54839781ed	Update lint failure msg to encourage lintrunner -a locally (#138232 ) This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-10-17 19:13:55 +00:00
Shivam Raikundalia	dfb5ac05cc	[Record Function] Add Kwargs only USER_SCOPE Macro (#138020 ) Summary: Add a macro such that users can easily add a USER annotation with kwargs only Test Plan: Will use D63801503 to test this E2E. Added unit test as well that makes sure that the kwargs get recorded correctly Differential Revision: D64420328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138020 Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi	2024-10-17 18:48:49 +00:00
Will Feng	0c76c68d7d	[tlparse][AOTAutograd] Rename to aot_inference_graph in tlparse output (#137803 ) Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803 Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang	2024-10-17 18:44:37 +00:00
zeshengzong	d531bd509e	[Docs] Fix description in `torch.save` docs to show default for pickle_protocol instead of variable name (#138153 ) Fixes #138013 Replace `DEFAULT_PROTOCOL` with actual default value `2` in `torch.save` method document Before ![image](https://github.com/user-attachments/assets/cdd77d14-c009-4848-8538-9256bf22c32a) After ![image](https://github.com/user-attachments/assets/f6b1063d-c955-478a-8d42-702b988426aa) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138153 Approved by: https://github.com/mikaylagawarecki	2024-10-17 18:13:05 +00:00
Richard Barnes	8abbd1c7c7	Modernize C10_NODISCARD to [[nodiscard]] (#138151 ) PyTorch is C++17 now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138151 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-17 18:07:39 +00:00
chilli	6752e7dc3e	Moved some of Inductor IR nodes to be frozen (#137859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137859 Approved by: https://github.com/ezyang	2024-10-17 18:04:45 +00:00
Michael Lazos	0b2c12cb4d	Support more foreach ops for tensor beta support (#134170 ) Add more foreach ops so we don't have fallbacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134170 Approved by: https://github.com/eellison	2024-10-17 17:51:31 +00:00
William Wen	92fdea8a39	remove skips due to https://github.com/pytorch/torchdynamo/issues/1991 (#138133 ) Closes https://github.com/pytorch/pytorch/issues/93479. A bunch of other dynamo-wrapped tests also exhibit "torch.* returned non-Tensor output unimplemented" making the issue seem less relevant to me. Some tests are marked as xfail as they fail for other reasons. If these tests are indeed important, we should create a new issue to track them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138133 Approved by: https://github.com/ezyang	2024-10-17 17:42:46 +00:00
Scott Wolchok	6b76a21ebd	[PyTorch] Fix incorrect macOS 15.0 gating in MPS backend (#138022 ) The ifdef as written just checks if the macOS 15.0-capable SDK is being used. You also need a runtime gate to make sure macOS 15 is in use. Differential Revision: [D64429453](https://our.internmc.facebook.com/intern/diff/D64429453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138022 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137722, #138014	2024-10-17 17:35:34 +00:00
PyTorch MergeBot	d2a6c73235	Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 )" This reverts commit 20af56d4359c3f5fed2e8f94e111a8502f2ebeb3. Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new tests are failing inductor distributed jobs ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2420109501))	2024-10-17 17:32:06 +00:00
Tugsbayasgalan Manlaibaatar	2a50d77823	Move test_experimental.py to training IR (#138140 ) Differential Revision: [D64510938](https://our.internmc.facebook.com/intern/diff/D64510938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138140 Approved by: https://github.com/avikchaudhuri	2024-10-17 17:30:10 +00:00
Joel Schlosser	ecc5e05854	Refactor NJT min / max seqlen handling for convenience (#138130 ) There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130 Approved by: https://github.com/soulitzer	2024-10-17 17:28:39 +00:00
PyTorch MergeBot	66478d0cf7	Revert "[compiled autograd] directly use python Logger class in cpp (#137953 )" This reverts commit af916613687d3bcc1d15362ba2fdf9312378c500. Reverted https://github.com/pytorch/pytorch/pull/137953 on behalf of https://github.com/clee2000 due to breaking builds internally D64479234, I think it makes the build size of a package too large? The logs link to a wiki with instructions of what to do ([comment](https://github.com/pytorch/pytorch/pull/137953#issuecomment-2420086928))	2024-10-17 17:19:36 +00:00
PyTorch MergeBot	3b0f3059f6	Revert "[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 )" This reverts commit ebe37b23f11e150cd3afa5464193ee036e15277f. Reverted https://github.com/pytorch/pytorch/pull/138113 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert https://github.com/pytorch/pytorch/pull/137953, please rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/138113#issuecomment-2420079703))	2024-10-17 17:16:44 +00:00
PyTorch MergeBot	375dcb960f	Revert "Avoid some dangling reference warnings (#132535 )" This reverts commit f3d7a02716d8725dcedff86094bd7e20f73155f1. Reverted https://github.com/pytorch/pytorch/pull/132535 on behalf of https://github.com/clee2000 due to broke some internal builds D64479234 ([comment](https://github.com/pytorch/pytorch/pull/132535#issuecomment-2419983509))	2024-10-17 16:23:36 +00:00
Shangdi Yu	348f208504	Autocast re-tracibility (#138082 ) Summary: Support autocast re-tracing by giving it the same treatment as set_grad. In re-tracing, when dynamo encounters an autocast HOP, we want it to trace through `with torch.autocast()` again, and replace the HOP with the traced subgraph. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast ``` Differential Revision: D63856081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138082 Approved by: https://github.com/ydwu4	2024-10-17 16:09:11 +00:00
Yidi Wu	3087b5e431	[cond] support lifted symint inputs in subgraph (#137519 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137519 Approved by: https://github.com/eellison	2024-10-17 16:09:06 +00:00
Zhuoran Zhao	2414c3f534	AOTI fixes for MI300 lowering (#137939 ) Summary: 1) Add sleef back to enable SIMD on AMD 2) adding kpack to triton compute_meta for AMD triton, since there will be user-defined triton kernels using this for k-dim packing Test Plan: ``` HIP_VISIBLE_DEVICES=0 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" buck run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --skip-flop-estimation --skip-trt --skip-ait --enable-aot-inductor --sync-mode=0 --gpu-trace --sample-input-tile-factor=1 --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge" --lowering-input-str='{"serialized_inference_model_input_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge","serialized_inference_model_output_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/mi300_output.merge","submodule_names_to_lower":["merge"],"inductor_lowering_context":{"aot_inductor_lowering_settings":{"use_scripting":true,"preset_lowerer":"ifu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":3,"output_precision":3, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}},"model_entity_id":925729118,"model_snapshot_id":0,"add_sample_inputs":false,"hardware_type":0,"platform_arch":1,"dense_in_place_format":2}' --precision=bf16 2>&1 \| tee local_benchmark_log.txt ``` Differential Revision: D64262924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137939 Approved by: https://github.com/frank-wei	2024-10-17 16:09:04 +00:00
Sungmin Cho	502c6183e0	Prevent tuple instances from being weak-referenced. (#137838 ) Summary: Currently, https://fburl.com/code/uka25j1i checks whether the guarded object supports weakref by looking at its `__class__` ``` if hasattr(guarded_object.__class__, "__weakref__") and not isinstance( guarded_object, enum.Enum ): obj_ref = weakref.ref(guarded_object) ``` However, we have reason to modify this slightly because we use classes that "pretend" to be some other classes (e.g. nn.Parameter). Example https://fburl.com/code/8bcktgoh : ``` class QuantizedWeights: # TODO: Ugly trick so torch allows us to replace parameters # with our custom weights. Do this properly. property def __class__(self) -> Type[nn.parameter.Parameter]: return nn.Parameter property def grad_fn(self) -> None: return None ``` For example, Fp8RowwiseWeights which inherit from the base class above and also from namedtuple, actually does not have `__weakref__` attribute, but its "class" will say it does. I think the easiest change is to use instance-level checking rather than class-level ``` if hasattr(guarded_object, "__weakref__") ... ``` But I'm wondering if this will harm any of the existing behaviors. I'd appreciate reviews from the experts (I just added all recommended reviewers since I'm not sure who is the best person to consult...) Test Plan: CI? Reviewed By: YJYJLee Differential Revision: D64140537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137838 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-10-17 16:08:32 +00:00
Laith Sakka	7e16c9d5f2	include bw_compiler in strobelight profile (#138060 ) Summary: title + tlparse will have the phase name. Test Plan: {F1933087525} Differential Revision: D64450315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138060 Approved by: https://github.com/ezyang	2024-10-17 16:08:28 +00:00
Will Feng	20af56d435	[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178 ) `test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support). This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178 Approved by: https://github.com/xmfan	2024-10-17 10:51:07 +00:00
CaoE	8cfe28e4e3	[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-10-17 09:06:57 +00:00
Tom Ritchford	47077bfcb5	Remove an unused variable in _subclasses.fake_tensor (#138086 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138086 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-17 09:05:25 +00:00
Laith Sakka	ba10259115	Increase default COMPILE_STROBELIGHT_MAX_STACK_LENGTH to 500 (#138006 ) Summary: pt2 call stacks are long, this reduces truncated stack <img width="1363" alt="Screenshot 2024-10-15 at 11 35 11 AM" src="https://github.com/user-attachments/assets/d09a8fb5-eafc-4440-ab58-464889dc6df8"> <img width="1373" alt="Screenshot 2024-10-15 at 11 35 26 AM" src="https://github.com/user-attachments/assets/c4c9c245-54d1-4e35-b16f-029ece335e03"> Differential Revision: D64414746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138006 Approved by: https://github.com/bobrenjc93	2024-10-17 07:31:32 +00:00
William Wen	5b7f4767ff	Fix https://github.com/pytorch/pytorch/issues/138062 (#138137 ) Fixes https://github.com/pytorch/pytorch/issues/138062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138137 Approved by: https://github.com/mlazos	2024-10-17 07:12:15 +00:00
Tugsbayasgalan Manlaibaatar	f3c3f3a3c3	Fix assigning tensor with requires_grad as constant in export (#137997 ) When we insert cojstants into unlifted graph, we need to detach them if they require grad BUT when we detach we need to preserve the original aliasing information. Differential Revision: [D64406859](https://our.internmc.facebook.com/intern/diff/D64406859/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137997 Approved by: https://github.com/avikchaudhuri	2024-10-17 06:41:10 +00:00
Edward Z. Yang	38d9924bfc	Disable lint suggestions on my PRs (#138054 ) The suggestions unusably clog up early draft PRs that are not necessarily lint clean yet. Making matters worse, even if I fix them I have to manually click through hundreds of comments to "Resolve" them even though I've fixed it. Disabling it on ghstack helps, but I occasionally do standard PRs via fbcode export mechanism. Opt me out. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138054 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/PaliC	2024-10-17 05:28:37 +00:00
cyy	af8bd323e8	Remove legacy Caffe2 pthreadpool from CMake (#134936 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936 Approved by: https://github.com/ezyang	2024-10-17 05:22:08 +00:00
Josh Fromm	9c084cccfd	[Pytorch][ATEN] Enable FP8 concatenate (#138046 ) Summary: Float8 is becoming and increasingly popular datatype now that it is well supported on GPUs. This diff enables FP8 to work with `torch.cat`. This is pretty straight forward since memory operations dont vary based on the input dtype, but can be quite helpful for FP8 based models. Test Plan: ``` buck2 run mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12 //caffe2/test:tensor_creation -- -r test_cat_all_dtypes_and_devices ``` Differential Revision: D64443965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138046 Approved by: https://github.com/eqy, https://github.com/qchip, https://github.com/jianyuh	2024-10-17 04:58:54 +00:00
Jing Xu	ebd60f4074	update CMAKE_PREFIX_PATH setting command (#134934 ) Current setting command of the `CMAKE_PREFIX_PATH` environment variable will overwrite values if it had already been set with some values. Changing it to `:` appends the conda env search path to its values to avoid library not found issues. `export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134934 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-10-17 04:19:18 +00:00
Edward Z. Yang	7db1f0b7b5	Minor assert error message improvement (#138053 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138053 Approved by: https://github.com/Skylion007	2024-10-17 03:54:15 +00:00
Will Feng	ebe37b23f1	[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113 ) Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2. In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager". Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113 Approved by: https://github.com/xmfan ghstack dependencies: #138105	2024-10-17 03:45:10 +00:00
Bin Bao	fe43f72be7	[AOTI] Remove the non-ABI-compatible mode (part 2) (#138047 ) Summary: Continue to clean up non-ABI-compatible mode related code. Differential Revision: [D64444327](https://our.internmc.facebook.com/intern/diff/D64444327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138047 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016, #138009	2024-10-17 02:54:24 +00:00
Bin Bao	2e67d7cc35	[AOTI] Remove the non-ABI-compatible mode (part 1) (#138009 ) Summary: The ABI-compatible mode has been turned on as default in https://github.com/pytorch/pytorch/pull/136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic. Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138009 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016	2024-10-17 02:48:26 +00:00
Nikita Shulga	7711f00553	[BE] Delete unused operator!= from the test (#138122 ) If method is unused, why not delete it altogether? Pull Request resolved: https://github.com/pytorch/pytorch/pull/138122 Approved by: https://github.com/swolchok	2024-10-17 02:24:48 +00:00
Joel Schlosser	906fe05895	Naive impls for NJT matmul (#138121 ) Our matmul support is abysmal - let's at least get this working and do it performantly later. Bonus: implements `bmm` as well. jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121 Approved by: https://github.com/cpuhrsch	2024-10-17 01:31:46 +00:00
zeshengzong	b4f7f4bf49	[Docs] Optimize parameter description to declare allowed type (1/N) (#137956 ) Inspired by issue #137422 and #103847 Optimize method parameter types in docs to given users a more clear about what expected to pass to methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137956 Approved by: https://github.com/albanD	2024-10-17 01:19:55 +00:00
Yifu Wang	c69f4518ec	[SymmetricMemory] fix a race condition in _pipelined_produce_and_all2all that can cause correctness issues for very small `chunk_producer`s (#138126 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138126 Approved by: https://github.com/lessw2020	2024-10-17 01:05:41 +00:00
Benjamin Glass	69e125a7e9	AOTInductor: fixup test (follow-up to #137401 ) (#137692 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137692 Approved by: https://github.com/desertfire	2024-10-17 00:40:21 +00:00
Jane Xu	94537e70b5	Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100 Approved by: https://github.com/Skylion007	2024-10-17 00:34:56 +00:00
Will Feng	504904c9c6	[Traceable FSDP2] Add compiled_autograd_enabled helper function (#138105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138105 Approved by: https://github.com/awgu, https://github.com/xmfan	2024-10-17 00:04:06 +00:00
Avik Chaudhuri	0e9708f907	tensor constant with wrapped method (#138091 ) Summary: Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs. Longer term maybe we should pull this fakification up in non-strict. Test Plan: added test Differential Revision: D64480272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091 Approved by: https://github.com/tugsbayasgalan	2024-10-17 00:00:04 +00:00
PyTorch MergeBot	4b3035f2fe	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit e7a4ad3b409c226a1da0f597c66ece7c06de0e9e. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))	2024-10-16 23:18:53 +00:00
PyTorch MergeBot	5254a0d383	Revert "Dont decompose aten.baddmm in inductor (#137904 )" This reverts commit cef6c3dcb07aafe25d62427e55442a46d7af3500. Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to failing internal tests D64418200, some results not within tolerance? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2418122735))	2024-10-16 23:16:44 +00:00
Brian Hirsh	ea2726452a	add myself as codeowner in aot_autograd (#138075 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138075 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #136670	2024-10-16 22:41:39 +00:00
Brian Hirsh	a682194a11	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-10-16 22:41:39 +00:00
Tom Ritchford	56379e2c17	Remove an unused variable in _subclasses.fake_impls (#138085 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138085 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-10-16 22:41:04 +00:00
Yidi Wu	0bfa1bf21d	[scan] support closure (#135602 ) This PR adds an additional_inputs argument to support closures similar to what we've done for while_loop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135602 Approved by: https://github.com/zou3519 ghstack dependencies: #135600, #135601	2024-10-16 22:28:03 +00:00
Yidi Wu	819d6b139c	[scan] flatten subgraph output and make subgraph inputs to be a slice (#135601 ) This pr introduces two changes: 1. Before this pr, the subgraphs output is ([], []), in this pr, we change it to a flattened list for easier codegen and consistency with other control flow operators. 2. Before the PR, the combine_fn of scan takes a sliced input but keep the sliced dimension. For exmaple, suppose xs = torch.randn(3, 4, 5) and we scan over dim 0, the combine_fn looks like: ``` # x.shape = (1, 4, 5) instead of (4, 5) def combine_fn(carry, x): ... ``` In this PR, we fixed this and also simplify some of the slicing logic. 3. this diff also make sure we always stack ys on fist dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135601 Approved by: https://github.com/zou3519 ghstack dependencies: #135600	2024-10-16 22:28:03 +00:00
Yidi Wu	0437a22d43	[scan] fix typo in signature and remove wrapper (#135600 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135600 Approved by: https://github.com/zou3519	2024-10-16 22:27:59 +00:00
Bin Bao	443472b1ca	[AOTI] Remove explicit abi_compatible setting in tests (#138016 ) Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138016 Approved by: https://github.com/malfet ghstack dependencies: #137982	2024-10-16 21:35:46 +00:00
Bin Bao	6bc57549f9	[AOTI] Remove non-ABI-compatible tests (#137982 ) Summary: Remove non-ABI-compatible mode tests since ABI-compatible has been turned on as default. Also clean up tests that explicitly set ABI-compatible to True. Differential Revision: [D64439673](https://our.internmc.facebook.com/intern/diff/D64439673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137982 Approved by: https://github.com/malfet	2024-10-16 21:35:46 +00:00
homorunner	a040c4a260	Use std::move on stringstream to prevent unnecessary copy. (#138065 ) - Takes advantage of C++20's improved handling of move semantics for std::basic_stringbuf. - Reduces unnecessary copying and improves memory efficiency, especially for long formatted strings. Benchmark(proof of concept): https://quick-bench.com/q/qohAu0ARH3vSDyKVsoKEfXOO6BI Pull Request resolved: https://github.com/pytorch/pytorch/pull/138065 Approved by: https://github.com/Skylion007	2024-10-16 21:35:10 +00:00
fduwjj	b72ff35f22	[c10d][ez] Add more inline comments to CUDAEventCache code (#138079 ) Address @kwen2501 's feedback in https://github.com/pytorch/pytorch/pull/138048, add more inline comments to the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138079 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048, #138059	2024-10-16 20:43:28 +00:00
Shangdi Yu	f2c96f5d87	Add AOTI test (#138043 ) Summary: add back the test that's removed in D63916320. It should work now as D64361273 added back the workspace change. Test Plan: CI Differential Revision: D64442054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138043 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-10-16 20:41:07 +00:00
Chirag Pandya	f95ddf0b31	[c10d] record world size in log (#138044 ) Summary: Record the world size in log and scuba table. This helps us quickly figure out if there are missing flight recorder files form ranks. Test Plan: Ran locally and noted that size was logged to scuba Differential Revision: D64442949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138044 Approved by: https://github.com/Skylion007	2024-10-16 20:14:02 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit 2b7c7a20b9c0e8e7f2773ffc5c9f79c3cae2070b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Henry Tsang	a0a978ce23	[aoti config] add raise_error_on_ignored_optimization (#138035 ) Summary: Unfortunately this means adding another config. Test Plan: ci Differential Revision: D64437699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138035 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-10-16 18:38:47 +00:00
angelayi	f1c741dbe9	Fixes GuardOnDataDependentSymNode error in masked_fill (#137060 ) Fixes [P1621441513](https://www.internalfb.com/phabricator/paste/view/P1621441513) ([ref to internal post](https://fb.workplace.com/groups/6829516587176185/posts/1051474609896021/?comment_id=1055262166183932&reply_comment_id=1056583932718422)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137060 Approved by: https://github.com/ezyang	2024-10-16 18:16:33 +00:00
Catherine Lee	f173623bb2	[td] try catch exception, do not run td if not results (#138087 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138087 Approved by: https://github.com/wdvr	2024-10-16 18:04:25 +00:00
Li Yu (ads)	dabe2a3c3b	[Torch] Support meta device in random.fork_rng (#137715 ) Summary: ## Why random.fork_rng doesn't support meta device: ``` [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/aps_models/ads/tools/memory_estimator/estimation_dense.py", line 655, in estimate_dense_memory_size [rank0]: losses.sum().backward() [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 604, in backward [rank0]: return handle_torch_function( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/overrides.py", line 1718, in handle_torch_function [rank0]: result = mode.__torch_function__(public_api, types, args, kwargs) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/_device.py", line 106, in __torch_function__ [rank0]: return func(args, kwargs) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 613, in backward [rank0]: torch.autograd.backward( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/__init__.py", line 347, in backward [rank0]: _engine_run_backward( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/graph.py", line 825, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1125, in unpack_hook [rank0]: frame.recompute_fn(args) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1507, in recompute_fn [rank0]: with torch.random.fork_rng( [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/runtime/lib/python3.10/contextlib.py", line 135, in __enter__ [rank0]: return next(self.gen) [rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/random.py", line 153, in fork_rng [rank0]: raise RuntimeError( [rank0]: RuntimeError: torch has no module of `meta`, you should register a module by `torch._register_device_module`. ``` This blocks us from running backward() on model with checkpoint enabled in meta mode. ## What This diff handles the case of meta device in random.fork_rng. Test Plan: Tested with toy model which has checkpoint on its module: P1641201046 Differential Revision: D64161410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137715 Approved by: https://github.com/kit1980	2024-10-16 18:00:39 +00:00
Shangdi Yu	a47bb4a393	Fix autocast for non-strict export (#137495 ) Summary: add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes. add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`. - In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`. if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806). ``` - torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function - torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__ - torch.amp._enter_autocast()'s torch.overrides.handle_torch_function - PreDispatchTorchFunctionMode.__torch_function__ ``` - in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes. - to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_set_grad ``` Differential Revision: D64016023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495 Approved by: https://github.com/bdhirsh	2024-10-16 17:39:00 +00:00
Zheng, Zhaoqiong	7ba706c74e	update get start xpu (#137479 ) 1. respect the comment from the community, downgrade the "Beta" to "Prototype" for the first xpu release with wheel 2. add wheels installation of torchaudio & torchvision for nightly on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/137479 Approved by: https://github.com/atalman, https://github.com/malfet	2024-10-16 17:36:29 +00:00
fduwjj	7e704c2073	[c10d] Add unit test for CUDAEventCache to ensure caching is working (#138059 ) We created a simple test to validate the cache is indeed working and when the cache is indeed used up. I revert the fix in (https://github.com/pytorch/pytorch/pull/138040) and the test indeed failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138059 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048	2024-10-16 17:34:57 +00:00
PyTorch MergeBot	dd32a32cb6	Revert "Expose option to disable CRC-32 computation during `torch.save` (#137735 )" This reverts commit 534fa96f2d9a4feb1dcdfaecb3d73990db60f819. Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))	2024-10-16 17:03:06 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
Tugsbayasgalan Manlaibaatar	0a6c40faba	Fix constant returning (#137993 ) When the constants are used twice in the exported graph (second one is returned as output), the lifting constant pass doesn't account for the second one being the output. THis PR fixes that. Differential Revision: [D64406108](https://our.internmc.facebook.com/intern/diff/D64406108/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137993 Approved by: https://github.com/avikchaudhuri	2024-10-16 16:42:09 +00:00
Scott Wolchok	189c95457d	[PyTorch] Don't hardcode 4 * Vec::size() in vectorized_reduction (#138014 ) This will break once we support 128-bit vectors, and there's no reason to do it. Differential Revision: [D64421982](https://our.internmc.facebook.com/intern/diff/D64421982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138014 Approved by: https://github.com/malfet, https://github.com/Skylion007 ghstack dependencies: #137722	2024-10-16 16:41:59 +00:00
Scott Wolchok	a12c859b00	[PyTorch] Check `defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256)` instead of `defined(CPU_CAPABILITY_NEON)` (#137722 ) The CPU_CAPABILITY system is for rebuilding kernels multiple times with different vector ISA targets. CPU_CAPABILITY_NEON was not being used for that, just as an extra flag for inductor. As a result, CPU_CAPABILITY_NEON-gated code was unnecessarily unavailable outside inductor. Fixes #137704 Differential Revision: [D64197046](https://our.internmc.facebook.com/intern/diff/D64197046/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137722 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-16 16:41:59 +00:00
PyTorch MergeBot	361f42bc42	Revert "[compiled autograd] Compiled autograd configs in TLS (#137821 )" This reverts commit 9aba0b91c8df4a15654f9ccc02abca31bdd81650. Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))	2024-10-16 16:38:29 +00:00
Tom Ritchford	af27f7888b	[dynamo] Remove an unused variable in AOTDispatchAutograd (#137989 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137989 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-16 16:37:19 +00:00
Nikita Shulga	753ba5d30a	Move basic dependencies install to requirements-ci (#138024 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138024 Approved by: https://github.com/huydhn ghstack dependencies: #137991, #137992, #138023	2024-10-16 16:21:33 +00:00
William Wen	4c8718d8e7	[dynamo] add torch.compiler.set_stance (#137504 ) Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771. Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly. See added tests for usage examples. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504 Approved by: https://github.com/yf225, https://github.com/jansel	2024-10-16 16:18:25 +00:00
fduwjj	960c3bff98	[c10d] Refactor CUDAEventCache Create to use deque rather than stack (#138048 ) We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138048 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040	2024-10-16 14:44:39 +00:00
Tom Ritchford	932ae131fb	Remove an unused variable in _inductor/codegen/simd.py (#138000 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138000 Approved by: https://github.com/Skylion007	2024-10-16 13:54:21 +00:00
Isuru Fernando	f3d7a02716	Avoid some dangling reference warnings (#132535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132535 Approved by: https://github.com/aaronenyeshi	2024-10-16 13:41:12 +00:00
Tom Ritchford	0c63de9755	[dynamo] Remove an unused variable in AutogradFunctionApplyVariable (#137985 ) ---- * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137985 Approved by: https://github.com/zou3519	2024-10-16 13:08:45 +00:00
Tom Ritchford	15722debfb	Remove two unused variables in _functorch/partitioners.py (#137998 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137998 Approved by: https://github.com/Skylion007	2024-10-16 10:58:31 +00:00
Simon Fan	9aba0b91c8	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-16 09:28:32 +00:00
Simon Fan	af91661368	[compiled autograd] directly use python Logger class in cpp (#137953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953 Approved by: https://github.com/jansel, https://github.com/yf225	2024-10-16 09:28:32 +00:00
amathewc	7f88bf96f9	test_execution_trace.py: Use instantiate_device_type_tests to run GPU tests on HPU as well (#133975 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. CHANGES - Add support for HPU devices within the payload function. - Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances. - Expand the supported_activities() function to include checks for torch.profiler.ProfilerActivity.HPU. - Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133975 Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi	2024-10-16 07:53:06 +00:00
cyyever	deaf0418b2	[2/N] Fix clang-tidy warnings in torch/csrc/api/ (#136998 ) Follows #134545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136998 Approved by: https://github.com/ezyang	2024-10-16 07:50:59 +00:00
Shuqiang Zhang	f4158558aa	[c10d] disable watchdog thread in blockingWait mode (#138001 ) Summary: Blocking wait mode is not widely used, probably useful in debugging. in blockingWait mode, we don't need to enable the watchdog thread to check the timeout or nccl error because the main thread would throw an exception if error happens and it is obvious to user which work fails and its user's responsibility to handle the exception. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #137799	2024-10-16 07:42:22 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit f43c4d28b8f955fe1f2b80f193815edadc95507b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Jason Ansel	7480e6938d	[inductor] Add LoopBody.op_counts (#137945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137945 Approved by: https://github.com/eellison ghstack dependencies: #137946	2024-10-16 06:35:10 +00:00
Jason Ansel	0d7b2118ed	[inductor] Refactor triton dtype helpers (#137946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137946 Approved by: https://github.com/eellison	2024-10-16 06:35:10 +00:00
Huy Do	97f7fc1d31	Support retry when building Docker images (#138012 ) Similar to https://github.com/pytorch/test-infra/pull/5759, I'm seeing flaky network error from time to time when building Docker images, for example https://github.com/pytorch/pytorch/actions/runs/11352439248/job/31575206417. So, adding retrying to mitigate this class of flaky failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138012 Approved by: https://github.com/atalman	2024-10-16 06:10:41 +00:00
fduwjj	084657e012	[c10d] Fix data corruption bug after CUDAEventCache is enabled (#138040 ) Here is why we see using `CUDAEventCache` cause crash and data corruption. 1. The deleter is doing its job and append the job the stack. 2. In create, instead of getting a reference, we are getting a copy of eventsArray_[i] (which is a std::vector). This is bad because we didn't really remove the element from the stack. While we thought we already pop up the last one from the stack, but it turns out the last one is still in the stack; we end up reusing the same event again and again. What's worse, since we keep adding new events to the stack, this will eventually explode the stack and a crash happens. Fix is easy, just get a reference. Local torchtitan run see a non-Nan loss. Also we want to use a deque instead of a stack, and refactor the code a bit to make it more readable. (in a separate PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138040 Approved by: https://github.com/kwen2501, https://github.com/shuqiangzhang	2024-10-16 05:20:29 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
Nikita Shulga	60b4858977	[BE][Docker] Don't update scikit-learn (#138023 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138023 Approved by: https://github.com/huydhn ghstack dependencies: #137991, #137992	2024-10-16 05:01:40 +00:00
Nikita Shulga	7f6e85bb93	[BE] Move numpy installation logic to `requirements-ci.txt` (#137992 ) And slightly adjust versioning logic, as current one seems to exist to hide version conflicts: - 1.21.2 for Python-3.9 - 1.24.2 for Python-3.10 (to resolve conflict with numba-0.55.2) - 1.26.2 for Python-3.11 or 3.12 - 2.1.2 for Python-3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137992 Approved by: https://github.com/Skylion007, https://github.com/huydhn ghstack dependencies: #137991	2024-10-16 04:30:29 +00:00
Nikita Shulga	12f4d91e84	Enable Python-3.13 builds on MacOS (#138037 ) All logic changes happen in builder repo, namely: - `a01e87535b` - `bcd0972459` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138037 Approved by: https://github.com/huydhn ghstack dependencies: #138041	2024-10-16 04:24:12 +00:00
Yu, Guangye	66b39fd474	refactor KERNEL_MPS via resuing KERNEL (#137831 ) # Motivation Reuse `KERNEL` to simplify `KERNEL_MPS` for mps autocast code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137831 Approved by: https://github.com/malfet	2024-10-16 03:54:13 +00:00
Yu, Guangye	2c94c54f10	Export XPU libs to be public (#136974 ) # Motivation Export XPU-related libs to be public. Now they are included in `TORCH_LIBRARIES` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136974 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-10-16 03:41:01 +00:00
Yifu Wang	80f3ee41dc	[SymmetricMemory] fix incorrect numel caculations that are using int as std::accumulate's accumulator (#138038 ) Fixes https://github.com/pytorch/pytorch/pull/137567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138038 Approved by: https://github.com/weifengpy	2024-10-16 03:34:26 +00:00
Howard Huang	75109682b6	[Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783 ) NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns. `ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783 Approved by: https://github.com/wconstab	2024-10-16 03:05:14 +00:00
Adnan Akhundov	809ff3b274	Add host-side Triton TMA support to Dynamo (#137677 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677 Approved by: https://github.com/zou3519	2024-10-16 02:18:48 +00:00
Nikita Shulga	dd2ae7d0c9	[BE] Use `x in [foo, bar]` (#138041 ) As shorthand for `x == foo or x == bar` And `x not in [foo, bar]` as shorthand for `x != foo and x != bar` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138041 Approved by: https://github.com/huydhn	2024-10-16 01:57:37 +00:00
Simon Fan	64ccebd2e0	update labeler for module: compiled autograd (#137954 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137954 Approved by: https://github.com/yf225	2024-10-16 01:56:21 +00:00
Nichols A. Romero	aa28062169	[ROCm] TunableOp more unit test follow-up - Part 2 (#134517 ) More unit tests to cover TunableOp functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517 Approved by: https://github.com/jeffdaily	2024-10-16 01:49:47 +00:00
zeshengzong	7fa7333299	[Distributed][Test] Fix todo in distributed test files (#136836 ) Refactor distributed test code: - Fix TODO: (rohan-varma): remove model - Fix TODO: add comments for TestTraverse - Migrate deprecated method call `load_state_dict` and `save_state_dict` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136836 Approved by: https://github.com/kwen2501	2024-10-16 01:15:12 +00:00
Shuqiang Zhang	a1b22e369b	[c10d] add an API to get the future result(success or failure) of a collective and customize error handling (#137799 ) Summary: This PR is trying to let users to know what exact collective call from the python thread is failing, and customize their own error handling function, instead of watchdog thread crashing everything. This is potentially very useful in fault tolerant training, in which we can have in-process restart. E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job. This is to allow users to check the status of each collective call, using the ivalue::future libs in PT core. This also allows users to attach its customized failure handling functions by: work.get_future_result().then(erro_handling_func) Note that the above call is also non-blocking for CPU thread Test Plan: Added a new test: test_get_future_result to verify the workResutl is correctly propagated to the users Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-10-16 00:20:09 +00:00
Nikita Lutsenko	8d9c9727c0	aten \| Fix set but unused variables warning in release builds. (#138008 ) Summary: Fixing a warning that happens only in release builds. Test Plan: Sandcastle + dependent diffs Reviewed By: boguscoder Differential Revision: D64415854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138008 Approved by: https://github.com/boguscoder, https://github.com/Skylion007	2024-10-16 00:05:39 +00:00
Edward Z. Yang	46ec4ad021	Add code pointer to internal Meta implementation (#137984 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137984 Approved by: https://github.com/albanD	2024-10-15 23:35:22 +00:00
PyTorch MergeBot	4557f6e339	Revert "[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 )" This reverts commit bf0b67059882933574f71a3b11b2f0127915ee5b. Reverted https://github.com/pytorch/pytorch/pull/137669 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing test_public_bindings in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/137669#issuecomment-2415331274))	2024-10-15 23:22:58 +00:00
Animesh Jain	19665f4619	[fake_tensor][cache] Supports ops with tuple of output tensors (#137935 ) This is needed for invoke_subgraph work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137935 Approved by: https://github.com/masnesral	2024-10-15 22:15:07 +00:00
Yifu Wang	5d5783a263	Improve the scheduling of _pipelined_multi_all_gather_and_consume (#137850 ) ``` Parallelization strategy: after each rank copies its shard into its local p2p buffer, every rank issues independent p2p copy -> shard_consumer sequences to two streams. In addition to computation/communication overlapping, the strategy allows for computation/computation overlapping, greatly reducing quantization inefficiency. Notation: - "mv" for the copy to local buffer - "cp" for p2p copies - "b" for barriers Constraints: - The GPU scheduler may or may not overlap "mv" with the first shard_consumer. - "cp" from different streams cannot overlap. Ideal scenario 0 - "mv" overlaps with the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Ideal scenario 1 - "mv" is scheduled before the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer: stream 0: [ shard_consumer ] [ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer: stream 0: [ shard_consumer ] [ cp ][ shard_consumer ] stream 1: [ mv ] [b][ cp ][ shard_consumer ] We haven't yet figured out a way to ensure "mv" and "b" are either overlapped with or scheduled before the first shard_consumer. Thus, to prevent suboptimal scenarios, we are giving up the chance to overlap "mv" and "b" with the first shard_consumer for now. ``` This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel: Before this PR: <img width="298" alt="image" src="https://github.com/user-attachments/assets/81e0a7f4-18ee-47c6-b258-04fdaca7a6a2"> With this PR: <img width="253" alt="image" src="https://github.com/user-attachments/assets/982de5a8-da1e-4a8f-b67e-c9c869b0a77f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137850 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738, #137805, #137836	2024-10-15 21:35:14 +00:00
Yifu Wang	2ae1a4caa1	Improve the scheduling of _pipelined_produce_and_all2all (#137836 ) ``` Parallelization strategy: every rank issues independent compute -> barrier -> p2p copy sequences on two streams. In addition to computation/communication overlapping, the strategy allows for computation/computation overlapping, greatly reducing quantization inefficiency. Ideally, stream activities would look like this ("b" for barriers, "cp" for p2p copies): [rank 0] stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] [rank 1] stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ] Note that the barriers synchronize streams with the same ID across ranks. They don't synchronize streams on the same rank. Since the work on both streams is independent, there's no guarantee that the chunk_producer from stream 0 or stream 1 will be scheduled first. If there is a scheduling mismatch across ranks, the barrier forces all ranks to wait for the slowest. When scheduling mismatches occur among ranks, the stream activities might look like this (note that p2p copies from different streams cannot overlap with each other): [rank 0] stream 0: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ] stream 1: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ] [rank 1] stream 0: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ] stream 1: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ] To prevent this, we need to ensure that the chunk_producer on stream 1 gets scheduled first on every rank. Without access to the underlying kernels, CUDA offers no API to control the scheduling order of two independent, overlapping kernels. Our solution is to issue a small sleep kernel in stream 0. The sleep duration is insignificant, but having an extra task in stream 0 will almost guarantee that the chunk_producer on stream 1 gets scheduled first. Once the first chunk_producer is scheduled in the correct order, there's very little room for the scheduling order of subsequent kernels to be inconsistent across ranks. ``` Currently, we perform stream synchronization to ensure scheduling order. The stream synchronization has no bearing on correctness, but prevents inconsistent scheduling orders across ranks. Without the stream synchronization, ranks may have inconsistent scheduling order, and the barriers cause all ranks to wait for the slowest rank: <img width="379" alt="image" src="https://github.com/user-attachments/assets/ffb97e76-7e19-4449-b121-83c32ec3e91d"> With stream synchronization, the inconsistent scheduling order issue is addressed, but we lose compute/compute overlapping (this is the state before this PR): <img width="378" alt="image" src="https://github.com/user-attachments/assets/4cb76246-625f-4fc1-b49a-823ae46d3f23"> With this PR, we get both consistent scheduling order across ranks and compute/compute overlap: <img width="327" alt="image" src="https://github.com/user-attachments/assets/51ab1bdc-4f60-46e0-b53c-6d208e2d4888"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137836 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738, #137805	2024-10-15 21:35:14 +00:00
Yifu Wang	ef541c1a65	[fused_all_gather_scaled_matmul] support rowwise scaling (#137805 ) This PR add support for `A_scale` to be row-wise scale. The op can automatically detect whether the row-wise scale is sharded or replicated. When the row-wise scale is sharded, the op would all-gather the scale in a pipelined fashion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137805 Approved by: https://github.com/weifengpy ghstack dependencies: #137643, #137738	2024-10-15 21:35:14 +00:00
Yifu Wang	05edaeaded	[fused_scaled_matmul_reduce_scatter] support rowwise scaling (#137738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137738 Approved by: https://github.com/Chillee, https://github.com/weifengpy ghstack dependencies: #137643	2024-10-15 21:35:14 +00:00
Yifu Wang	91bc9dc2c9	[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643 ) Suggested by @lw for better safety/reliability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137643 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-10-15 21:35:14 +00:00
Jane Xu	eaec72d1e6	Link directly to new Custom Ops Landing Page (#137933 ) e.g., click on first link in https://docs-preview.pytorch.org/pytorch/pytorch/137933/library.html#testing-custom-ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/137933 Approved by: https://github.com/zou3519	2024-10-15 21:18:21 +00:00
Tristan Rice	aef4317ec8	[c10d] socket: retry connection timeout failures (#138003 ) This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout. The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc. Example failure: ``` socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out). socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known). ... repeats ipv4 connection failure ``` From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html ``` ETIMEDOUT Timeout while attempting connection. The server may be too busy to accept new connections. Note that for IP sockets the timeout may be very long when syncookies are enabled on the server. ``` Test plan: CI for backwards compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro	2024-10-15 21:17:05 +00:00
Michael Lazos	bf0b670598	[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669 ) Fixes https://github.com/pytorch/pytorch/issues/114369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669 Approved by: https://github.com/anijain2305	2024-10-15 20:52:58 +00:00
Matthew Levy	28a521e29a	[fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow (size 4) in c10::IValue::IValue() (#137924 ) Summary: Calling `pop()` on empty stack Test Plan: CI Differential Revision: D64332420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137924 Approved by: https://github.com/Skylion007	2024-10-15 20:42:47 +00:00
Xu Han	3ecec0c90c	skip lintrunner install on Windows. (#137981 ) `lintrunner` is not support Windows x64. Ref: https://pypi.org/project/lintrunner/#files When we install python dependency by `pip install -r requirements.txt` on Windows x64, it will failed on `lintrunner`. <img width="887" alt="image" src="https://github.com/user-attachments/assets/e3815177-e893-41ae-96af-8b39d12f74a7"> Solution: skip install `lintrunner` on Windows. Reference doc: https://peps.python.org/pep-0508/#environment-markers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137981 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2024-10-15 20:37:26 +00:00
Ke Wen	35fc24fbed	[PGNCCL] Fix bugs in non-blocking mode (#137741 ) ### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // https://github.com/NVIDIA/nccl/issues/1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741 Approved by: https://github.com/shuqiangzhang	2024-10-15 20:35:39 +00:00
Nikita Lutsenko	370d66d7dd	aten/buck \| Appropriately convert clang => msvc compiler_flags. (#137944 ) Summary: fPIC is not available in clang on Windows - filter it out. Also configure the flags appropriately for MSVC. Reviewed By: rameshviswanathan Differential Revision: D64365660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137944 Approved by: https://github.com/mwdavis84, https://github.com/ChristianK275, https://github.com/boguscoder	2024-10-15 20:21:01 +00:00
Alex Baden	487873f7ca	[Inductor]: Support updated Triton `AttrsDescriptor` (#137757 ) The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor. Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments. Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757 Approved by: https://github.com/jansel	2024-10-15 19:34:59 +00:00
Mikayla Gawarecki	534fa96f2d	Expose option to disable CRC-32 computation during `torch.save` (#137735 ) Option only works in open source, not internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735 Approved by: https://github.com/albanD	2024-10-15 19:30:02 +00:00
Andrew Gu	3cc8c8b944	[FSDP2] Add `set_unshard_in_backward(bool)` (#137922 ) For some expert use cases, the user knows some parameters are not required for backward, so we can skip the unshard in backward. One example is the embedding weight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137922 Approved by: https://github.com/weifengpy	2024-10-15 19:11:14 +00:00
Laith Sakka	60cf72e028	enable auto functionalize v2 by default (#136685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136685 Approved by: https://github.com/zou3519 ghstack dependencies: #137760	2024-10-15 19:04:42 +00:00
Laith Sakka	05b6200ccd	Do not compute base in export mode (#137760 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137760 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-10-15 19:04:42 +00:00
drisspg	f5e38f65c5	[FlexAttention] Support training bias for eager (#136910 ) (#137526 ) This PR is Part 2 of the implementation started in https://github.com/pytorch/pytorch/pull/136910, rolled in the updates from https://github.com/pytorch/pytorch/pull/137451. Original was reverted due to calls to #@torch.libary at `import torch` time, so added a call to register at first call to `ModIndex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137526 Approved by: https://github.com/Chillee, https://github.com/zou3519	2024-10-15 18:55:22 +00:00
PyTorch MergeBot	cd292908e5	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c48fe8901114aa2b0a9c2d77f915a2ad8ab2098b. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/clee2000 due to breaking some internal tests, probably usages of string_view that need to be changed? ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2414775064))	2024-10-15 18:55:09 +00:00
Siddhartha Menon	e1e6417d4c	Add SVE implementation of embedding_lookup_idx (#133995 ) Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-10-15 18:52:44 +00:00
Nikita Shulga	b09d6f3a7d	[EZ][BE] Delete 3.8 specific checks (#137991 ) As we no longer support 3.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137991 Approved by: https://github.com/Skylion007	2024-10-15 18:45:49 +00:00
Aaron Orenstein	524fe784ec	BundledAutotuneCache (take 2) (#137902 ) Summary: Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Attempt 2 of #134959 (D60677499). Various configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Test Plan: unit tests Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<< FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D64336043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902 Approved by: https://github.com/oulgen	2024-10-15 18:39:47 +00:00
albanD	bf77f52895	Fix memory leak on masked Tensor (#137890 ) Note that this reverts the change from https://github.com/pytorch/pytorch/pull/137815 as well which is not needed anymore! Without this, you create an unbeakable reference cycle. It is unbreakable because part of the cycle is through the autograd graph which we cannot traverse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137890 Approved by: https://github.com/atalman, https://github.com/huydhn, https://github.com/Skylion007	2024-10-15 18:37:55 +00:00
Huy Do	0b7ef196cd	Use filelock to build extension_device backend one at a time (#137930 ) Fixes https://github.com/pytorch/pytorch/issues/136125 Fixes https://github.com/pytorch/pytorch/issues/137026 Fixes https://github.com/pytorch/pytorch/issues/137027 The compilation fails during `setUpClass`, so disabling the test doesn't do nothing. The theory I have for this flaky issue is that `test_open_device_registration` from both `TritonExtensionBackendTests` and `ExtensionBackendTests` are run in parallel and cleaned up while the other is still in fly, causing flaky failure. Here is an example failure https://github.com/pytorch/pytorch/actions/runs/11331105492/job/31512603585#step:22:1710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137930 Approved by: https://github.com/malfet	2024-10-15 17:46:28 +00:00
PyTorch MergeBot	60eb3fccfa	Revert "[ONNX] Remove ExportTypes (#137789 )" This reverts commit 3e0b83ad1f0a998ef8a72c5e82d9250ab800cce5. Reverted https://github.com/pytorch/pytorch/pull/137789 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))	2024-10-15 17:40:06 +00:00
PyTorch MergeBot	2831af39c4	Revert "[ONNX] Remove deprecated export_to_pretty_string (#137790 )" This reverts commit d0628a7e3921639f62d6a6fec9f9f1871e087533. Reverted https://github.com/pytorch/pytorch/pull/137790 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))	2024-10-15 17:40:06 +00:00
PyTorch MergeBot	dac0b4e62b	Revert "Add SVE implementation of embedding_lookup_idx (#133995 )" This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62. Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))	2024-10-15 17:23:50 +00:00
PyTorch MergeBot	d4d687ffb2	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:16 +00:00
PyTorch MergeBot	9af4e0d2aa	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit a6eb0205225fce7ba7a75d200566613b84aff4e9. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:15 +00:00
Pian Pawakapan	44653895cc	override bool(), is_nonzero for real tensor tracing (#136788 ) Fixes bool() and is_nonzero() calls for real tensor tracing, non-strict export Differential Revision: D63482693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136788 Approved by: https://github.com/ezyang	2024-10-15 17:13:44 +00:00
Haifeng Jin	bdbe0cfffa	Fix test_binary_ufuncs.py for NumPy 2 (#137937 ) Related to #107302 The following tests failed in test_binary_ufuncs.py when testing with NumPy 2. ``` FAILED [0.0050s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_complex64 - AssertionError FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_float32 - AssertionError FAILED [0.0048s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_complex64 - AssertionError FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_float32 - AssertionError FAILED [0.0028s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_shift_limits_cpu_uint8 - OverflowError: Python integer -100 out of bounds for uint8 ``` This PR fixes them. More details: * `test_shift_limits` failed because `np.left_shift()` and `np.right_shift()` no longer support negative shift values in NumPy 2. * `test_scalar_support` failed because NumPy 2 changed its dtype promo rules. We special-cased the incompatible cases by changing the expected dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137937 Approved by: https://github.com/albanD	2024-10-15 17:04:24 +00:00
Nikita Shulga	e4d7676c1b	[CPU] Expand `torch.special.i1` to Half and BF16 (#137899 ) To match behavior of `torch.special.i0` Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849 Also, add explicit high-precision template specialization for `calc_i0` and `calc_i1` for `BFloat16` and `Half` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899 Approved by: https://github.com/Skylion007	2024-10-15 17:00:58 +00:00
Daniel Velkov	4abe38bc94	RMSprop docs: add missing input "epsilon" (#137854 ) Adding a missing input argument in the docs for RMSprop. Like in the doc for AdamW https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/137854 Approved by: https://github.com/janeyx99	2024-10-15 16:40:42 +00:00
Haifeng Jin	822aa588bc	Fix torch_np/test_basic for NumPy 2 (#137814 ) Related to #107302 `TestExport.test_exported_objects` in `test/torch_np/test_basic.py` is failing with NumPy 2. The test is checking if all methods under `torch._numpy` exist in `numpy`. However, some of them are removed in NumPy 2. This PR fixes the issue by not checking the removed methods when running with NumPy 2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137814 Approved by: https://github.com/albanD	2024-10-15 16:40:28 +00:00
Isuru Fernando	120fbe9caa	Update inductor benchmark time to avoid flakiness (#137900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900 Approved by: https://github.com/laithsakka	2024-10-15 16:17:04 +00:00
Jack Taylor	966a1a971e	[ROCm] Add AMDSMI support for UUID input (#129741 ) Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2024-10-15 15:56:30 +00:00
Prachi Gupta	17ed403644	[ROCm] Enable test_triton* in test_sparse_csr suite (#137712 ) All test_triton* UTs are now passing on ROCm within test_sparse_csr suite. See logs here: https://ossci-raw-job-status.s3.amazonaws.com/log/31376189926 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137712 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-10-15 15:41:21 +00:00
Wang, Eikan	5689e33cfe	[Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794 ) Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`. https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99 Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794 Approved by: https://github.com/atalman ghstack dependencies: #137873	2024-10-15 15:31:37 +00:00
yintong-lu	3361908fc5	torch/ao/quantization/utils.py: Moving eps to targeted device to avoid device mismatch issue (#135204 ) MOTIVATION We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_. CHANGES Move eps to _device_ of other tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-10-15 14:58:55 +00:00
eellison	cef6c3dcb0	Dont decompose aten.baddmm in inductor (#137904 ) Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator. Fix for https://github.com/pytorch/pytorch/issues/137897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904 Approved by: https://github.com/ngimel	2024-10-15 14:54:56 +00:00
Richard Barnes	b7f798caa4	Use C10_UNUSED instead of (void)X (#137239 ) Summary: Auto-generated with ``` buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\sfor\s$)(const.\n)\s\(void$[A-Za-z]+;\s//\sSuppress.\s\n(.)' --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".\.$cpp\\|h$"` ``` Differential Revision: D33432600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239 Approved by: https://github.com/Skylion007	2024-10-15 14:32:59 +00:00
Tom Ritchford	e7a4ad3b40	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-15 13:51:20 +00:00
Xiaodong Wang	5141ade8e3	[AMD] Do not skip 0-byte send/recv (#137952 ) Summary: With https://github.com/ROCm/rccl/pull/1376, we can remove this hack now and we have verified that we no longer run into hang Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-xdwang-900def406a?job_attempt=0&version=1&env=PRODUCTION Differential Revision: D64370817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137952 Approved by: https://github.com/eqy	2024-10-15 09:35:03 +00:00
Xiaodong Wang	b7be4b1e48	[AMD] Turn on fast path for index_put (#136136 ) Summary: This slow path is bad because it has a sync point which makes CPU really slow. I'm not very sure if AMD actually needs this with the newer rocm versino {F1870213925} Test Plan: CI Differential Revision: D62731130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136136 Approved by: https://github.com/danzimm, https://github.com/jeffdaily, https://github.com/eqy	2024-10-15 08:39:17 +00:00
Wang, Eikan	f42d1b6fa1	Fix Intel GPU test failure due to unsupport bool for unfold (#137873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137873 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-10-15 07:58:51 +00:00
cyy	8c860aef0d	[Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942 ) Reland of #137328, which was reverted due to reverting a dependent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137942 Approved by: https://github.com/eqy	2024-10-15 07:47:24 +00:00
Ke Wen	56cc22eb01	[CI][Distributed] Not to test distributed_test.py with UCC (#137932 ) Some UCC tests became unstable recently, with or without the M60 to T4 upgrade. See for example: #137855 (without upgrade), #137161 (with upgrade). So I am extracting the disablement from #137161 here. Failure signature: ``` RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:496] [Rank 0][ProcessGroupUCC-0][READY]failed to post triggered collective, error code -6: Unhandled error, system error code 0 ``` Earlier discussed here: https://github.com/pytorch/pytorch/pull/137161/files#r1797353294 Cc: @Aidyn-A @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137932 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/eqy	2024-10-15 07:22:57 +00:00
Edward Z. Yang	5b442e8e92	Time torch_key computation in overall Dynamo stats (#137877 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137877 Approved by: https://github.com/oulgen, https://github.com/masnesral	2024-10-15 05:47:19 +00:00
Edward Z. Yang	5c3ba6faff	Add fbscribelogger to Dynamo benchmark runner (#137867 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867 Approved by: https://github.com/bobrenjc93	2024-10-15 04:36:41 +00:00
Brian Hirsh	ed94725b8c	log ViewAndMutationMeta to trace_structured (#133784 ) I ended up bundling it into the existing tlparse logs for the AOT forward graph, since it looked like registering it as a separate artifact requires changes to tlparse itself (maybe that is wrong though?) Example new fw AOT graph tlparse output for the below code: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp70zKiO/0_0_0/aot_forward_graph_2.txt ``` import torch @torch.compile def f(x): out1 = torch.view_as_complex(x) out2 = torch.view_as_complex(x) return out1, out2, x * 2 x_ = torch.randn(4, 2, requires_grad=True, dtype=torch.float64) out = f(x_) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133784 Approved by: https://github.com/ezyang	2024-10-15 02:49:02 +00:00
cyy	70206499f1	[3/N] Fix extra warnings brought by clang-tidy-17 (#137552 ) Follows #137459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137552 Approved by: https://github.com/ezyang	2024-10-15 02:33:44 +00:00
FFFrog	a6eb020522	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) ---- - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang	2024-10-15 01:53:28 +00:00
Bob Ren	b34db401f2	Add support for div in tensorify_python_scalars fx pass (#137623 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137623 Approved by: https://github.com/ezyang	2024-10-15 01:49:46 +00:00
Michael Lazos	8316f9b2a0	Fix autograd function calls without context arg (#137809 ) Fixes an issue where if the context arg is not provided, Dynamo would throw an arg mismatch error. The skips are there because Dynamo would previously fall back to eager on those tests due to the arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137809 Approved by: https://github.com/drisspg	2024-10-15 01:25:47 +00:00
Ryan Guo	a89cf2b59a	[dynamo] Don't codegen temporary cells for pre-existing cells (#137907 ) This patch removes tempvar codegen for the `NewCellVariable` that has `AttributeMutationExisting`, because these tempvar will never get used. Note that tempvar codegen for other objects also follow this pattern, i.e., it only fires on `AttributeMutationNew`. To visualize, in the following program, we'll see the modified bytecode contains redundant `make_cell` calls, and stores the result to a local `tmp_2` which is never used again. ```python import torch def test_write_cell(): count = torch.ones(1) def inc(): nonlocal count count = count + 1 torch.compile() def fn(): inc() fn() test_write_cell() ``` ``` $ TORCH_LOGS="bytecode" TORCH_LOGS_FORMAT="short" python test.py ...... 0 COPY_FREE_VARS 1 2 RESUME 0 4 LOAD_GLOBAL 9 (NULL + __compiled_fn_2) 14 LOAD_DEREF 3 (inc) 16 LOAD_ATTR 6 (__closure__) 36 LOAD_CONST 1 (0) 38 BINARY_SUBSCR 42 LOAD_ATTR 4 (cell_contents) 62 CALL 1 70 STORE_FAST 0 (graph_out_0) 72 LOAD_GLOBAL 0 (__import_torch_dot__dynamo_dot_utils) 82 LOAD_ATTR 3 (NULL\|self + make_cell) 102 CALL 0 110 STORE_FAST 2 (tmp_2) 112 LOAD_FAST 0 (graph_out_0) 114 LOAD_CONST 1 (0) 116 BINARY_SUBSCR 120 LOAD_DEREF 3 (inc) 122 LOAD_ATTR 6 (__closure__) 142 LOAD_CONST 1 (0) 144 BINARY_SUBSCR 148 STORE_ATTR 2 (cell_contents) 158 DELETE_FAST 0 (graph_out_0) 160 RETURN_CONST 0 (None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137907 Approved by: https://github.com/anijain2305	2024-10-15 00:49:45 +00:00
chilli	1cf78bbf62	Refactored debug_extra to be on ChoiceCaller (and called description) (#137857 ) Before: <img width="644" alt="image" src="https://github.com/user-attachments/assets/17b0fa8a-37c8-494b-8914-9d42c3db4bef"> After: <img width="1292" alt="image" src="https://github.com/user-attachments/assets/5ee59747-a34f-4dd6-b943-cb5a53d52080"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137857 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/masnesral ghstack dependencies: #137768	2024-10-15 00:48:14 +00:00
Edward Z. Yang	3630398509	Move symbolic_shapes create_env back to INFO (#137926 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137926 Approved by: https://github.com/Skylion007	2024-10-15 00:37:01 +00:00
cyyever	406db6a73d	Improve ASAN path detection (#137865 ) Follows #137335, for better adoption of latest clang to ASAN jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137865 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-14 23:54:46 +00:00
Shivam Raikundalia	aef3591998	[Profiler] Add Test for Clear on Fork (#137511 ) Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected. Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)' Differential Revision: D63992036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-10-14 23:20:33 +00:00
Nikita Shulga	0786b37260	[MPS] Add i0 op (#137849 ) More-or-less verbatim copy of `47c8aa8090/aten/src/ATen/native/Math.h (L101)` Plus a bit of a MPS boilerplate code Update test_mps.py to mark kaiser_window and i0 as passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849 Approved by: https://github.com/Skylion007	2024-10-14 22:50:01 +00:00
Nikita Shulga	18587f2427	[BE] Use `std::enable_if_t` in Math.h (#137920 ) PyTorch is C++17 project, so let's use some C++17 convenience methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/137920 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-14 22:20:09 +00:00
Catherine Lee	8ac06467d4	Forward fix test (#137910 ) Summary: Add back in a deleted file to fix test It was removed in https://github.com/pytorch/pytorch/pull/137404 Test Plan: `buck2 build --flagfile fbcode//mode/opt fbcode//caffe2/test/cpp/c10d:ProcessGroupGlooAsyncTest` succeeded Differential Revision: D64341074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137910 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/kit1980	2024-10-14 22:07:29 +00:00
Jerry Zhang	ad134fe038	Skip doc test internally (#137813 ) Summary: there are some path issues when we run the doc tests internally https://www.internalfb.com/intern/test/281475143872621 Test Plan: sandcastle Reviewed By: drisspg, msaroufim Differential Revision: D64255824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137813 Approved by: https://github.com/HDCharles	2024-10-14 21:29:15 +00:00
Eddie Yan	7911bf591d	[CUDA][Inductor] Fix some `bfloat16` tests for SM70 (#137675 ) Unsure about the `runtime_checks` changes as that's a pure pattern-match and guess Pull Request resolved: https://github.com/pytorch/pytorch/pull/137675 Approved by: https://github.com/eellison, https://github.com/jansel	2024-10-14 20:42:48 +00:00
atalman	6016b8a9be	Remove CI/CD python 3.8 requirements (#137893 ) Python 3.8 is deprecated from CI/CD. No reason have these pins Pull Request resolved: https://github.com/pytorch/pytorch/pull/137893 Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD, https://github.com/kit1980	2024-10-14 20:28:48 +00:00
PyTorch MergeBot	3b7710316c	Revert "cublaslt autotuning support for TunableOp (#133896 )" This reverts commit 19bbbef79da8ed32f72d6e76517cb639d5db6c00. Reverted https://github.com/pytorch/pytorch/pull/133896 on behalf of https://github.com/clee2000 due to this is breaking internal builds, I've copied what I think is the most relevant part of the log below. I believe the job running internally uses an old version of cuda, could you put guards to make sure compilation still words on an older version of cuda/cublaslt? ([comment](https://github.com/pytorch/pytorch/pull/133896#issuecomment-2412180893))	2024-10-14 20:28:09 +00:00
PyTorch MergeBot	df0c2f5cae	Revert "[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328 )" This reverts commit 25ac5652d003c5526f496bd1e2cdfbe697c58ba4. Reverted https://github.com/pytorch/pytorch/pull/137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/137328#issuecomment-2412143739))	2024-10-14 20:22:26 +00:00
Jagadish Krishnamoorthy	674d59359d	[ROCm] Enable dist sharded_tensor test suites (#137724 ) Following test suites are enabled on ROCm test_sharded_tensor test_sharded_tensor_reshard test_sharding_plan Pull Request resolved: https://github.com/pytorch/pytorch/pull/137724 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-10-14 20:20:57 +00:00
Alex Baden	39d21ed803	[Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458 ) The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](`72c9833927`)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes. Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively). With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458 Approved by: https://github.com/jansel	2024-10-14 20:20:29 +00:00
rzou	11e4232b42	Revert "[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872 )" (#137891 ) This reverts commit e688b78791d01bd91614a61e57726c32beb46ee4. We're reverting this because: 1) The original PR (#134872) fixed a bug but caused another one. The assessment is that the bug it caused is worse than the bug it fixed. 2) it was reverted on the release 2.5 branch, so we want to prevent divergence 3) The original author is out-of-office for a while so we don't want the divergence to wait until they're back Pull Request resolved: https://github.com/pytorch/pytorch/pull/137891 Approved by: https://github.com/Skylion007	2024-10-14 20:12:58 +00:00
Will Constable	41c4aa9f7a	[pipelining] rename prev_/next_stage vars to clarify (#137739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137739 Approved by: https://github.com/H-Huang	2024-10-14 20:12:18 +00:00
drisspg	78299d75b7	[ScaledMM] More Large shape tuning (#137832 ) Fixes buggy in previous PR with check, and also after some more performance tuning at very large sizes found that when N > M it is valuable to transpose otherwise performance is better untransposed: If you look at the absolute Tflops I think we still have some room for improvement! ### Perf Here are some TFLOP deltas at larger sizes where green is the positive gain in TFLops at different values of K ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_delta_heatmap](https://github.com/user-attachments/assets/dcd009a5-1e4f-449c-b852-a92bb7db66e3) <details> <summary>### Different Values of K</summary> ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K24576_tflops_delta_heatmap](https://github.com/user-attachments/assets/8c043f6c-b8aa-48a9-bd5d-3ec6f39818cd) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K16384_tflops_delta_heatmap](https://github.com/user-attachments/assets/41a4b9f4-2749-4a84-b9c7-bddc2c2334c0) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K12288_tflops_delta_heatmap](https://github.com/user-attachments/assets/68d42421-cfa9-4a0a-a5a5-9f6db80bf609) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K8192_tflops_delta_heatmap](https://github.com/user-attachments/assets/c03906a0-5de7-463e-96a8-85f1774b3af6) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K6144_tflops_delta_heatmap](https://github.com/user-attachments/assets/d697b2d0-efc9-4ea8-9002-d517f3abaf50) ![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K4096_tflops_delta_heatmap](https://github.com/user-attachments/assets/06f8ef5c-277f-45ca-a44f-ed2e54d4133a) </details> <details> <summary>### Absolute Tflops</summary> ## Old ![large_shape_old_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/8872506b-0ff1-400e-8d11-71eff6d8d59a) ## New ![update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/9fc9ec24-ff1a-4b47-8934-72d181677d14) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137832 Approved by: https://github.com/vkuzo	2024-10-14 20:02:52 +00:00
Edward Z. Yang	d64492e4cb	Increase verbosity of inductor cache hit/miss to INFO level (#137876 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137876 Approved by: https://github.com/Skylion007	2024-10-14 19:59:31 +00:00
eqy	914c90dcea	[NCCL][CUDA] Set `PYTORCH_C10_DRIVER_API_SUPPORTED` in `ProcessGroupNCCL.cpp` compilation (#137828 ) Otherwise `expandable_segments()` is hardcoded to false in `CUDAAllocatorConfig.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137828 Approved by: https://github.com/yifuwang, https://github.com/Skylion007	2024-10-14 19:38:23 +00:00
Joel Schlosser	19918a1863	Fix autograd.Function + NJT when an output grad is None (#136875 ) For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws: ``` RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers ``` This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875 Approved by: https://github.com/soulitzer, https://github.com/huydhn	2024-10-14 19:31:50 +00:00
ErezYosef	197601eeea	Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107 ) A proposal addressing Issue #1489: Optimizer should track parameter names and not id. (also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552) ## Summary This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id. Optimizers can be initialized with `named_parameters()` as: ```python optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9) ``` This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as: ``` state_dict = { 'state': { 0: {'momentum_buffer': tensor(...), ...}, 1: {'momentum_buffer': tensor(...), ...}, }, 'param_groups': [ { 'lr': 0.01, 'weight_decay': 0, ... 'params': [0,1] 'param_names' ['layer.weight', 'layer.bias'] (optional) } ] } ``` Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored. ## Key Features #### Named Parameters in Optimizer Initialization: Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly. #### Parameter Names in `state_dict`: The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters. ## Backward Compatibility #### No Breaking Changes: This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer. #### Customization with Hooks: For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs. ## Documentation Updates Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively. ## Solution Example: A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order. The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict : ```python def adapt_state_dict_ids(optimizer, state_dict): # assuming a single param group. current_state_group = optimizer.state_dict()['param_groups'][0] loaded_state_group = state_dict['param_groups'][0] # same number of params, same names, only different ordering current_state_name_to_id_mapping = {} # mapping -- param_name: id for i, name in enumerate(current_state_group['param_names']): current_state_name_to_id_mapping[name] = current_state_group['params'][i] # changing the ids of the loaded state dict to match the order of the given state dict. for i, name in enumerate(current_state_group['param_names']): loaded_state_group['params'][i] = current_state_name_to_id_mapping[name] return state_dict ``` In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`. Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict. ### Note This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-10-14 19:24:44 +00:00
Tom Ritchford	4470339fbb	[dynamo] Fix an error in _dynamo.compiled_autograd.reset() (#137889 ) ---- * From https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137889 Approved by: https://github.com/Skylion007	2024-10-14 18:21:18 +00:00
Huy Do	929797dedb	Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835 ) The test is failing (flakily?) on periodic Windows CUDA jobs with the following error: ``` __________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________ Traceback (most recent call last): File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop os.remove(filename) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv' ``` For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097 The test tried to catch and ignore this, but this is Windows. So, the fix is to: 1. Ignore if these files couldn't be removed 2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835 Approved by: https://github.com/atalman	2024-10-14 17:28:33 +00:00
PyTorch MergeBot	f8a5b7170a	Revert "Fix autograd.Function + NJT when an output grad is None (#136875 )" This reverts commit 76ab1ab66560213701943ecde368aedcd5de08e5. Reverted https://github.com/pytorch/pytorch/pull/136875 on behalf of https://github.com/jbschlosser due to Caused memory leak ([comment](https://github.com/pytorch/pytorch/pull/136875#issuecomment-2411665776))	2024-10-14 16:00:44 +00:00
Bob Ren	47bb494e49	Add support for sub in tensorify_python_scalars fx pass (#137622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622 Approved by: https://github.com/ezyang ghstack dependencies: #137620	2024-10-14 15:37:29 +00:00
Bob Ren	f246507f28	Add support for add in tensorify_python_scalars fx pass (#137620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620 Approved by: https://github.com/ezyang	2024-10-14 15:10:27 +00:00
Sanket Purandare	a77145ae2f	Selective Activation Checkpointing (SAC) Estimator for estimating memory and recomputation time trade-offs. (#135208 ) This PR adds a Selective Activation Checkpointing (SAC) Estimator, built on top of the `Runtime Estimator`, for estimating memory and recomputation time trade-offs. It provides a `TorchDispatchMode` based context manager that estimates the memory and runtime trade-offs of functions or `torch.nn.Modules` for SAC, using the `Runtime Estimator` #134243 under the hood to support two estimation modes: 'operator-level-benchmark' and 'operator-level-cost-model' (roofline model). The SAC Estimator provides detailed statistics and metadata information for operators of each module, including greedy order for selecting operators to be recomputed/checkpointed and per-module trade-off graphs. This estimator is designed to be used under FakeTensorMode and currently supports estimation of compute time and memory usage." It's inspired from: [XFormers SAC](https://github.com/facebookresearch/xformers/blob/main/xformers/checkpoint.py) by @fmassa End-to-end example: ``` import torch from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.sac_estimator import SACEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 8, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) inp = torch.randint( 0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev ) sace = SACEstimator() with sace(estimate_mode_type='operator-level-cost-model'): loss = model(inp).sum() loss.backward() sace.pwlf_sac_tradeoff_curve(n_segments=2, save_tradeoff_graphs=True) sace.display_modulewise_sac_stats(depth=4, print_tabular=True) ``` Example AC Stats for one of the transformer layers: ![Screenshot 2024-10-11 at 10 09 13 PM](https://github.com/user-attachments/assets/1cf85564-4319-4732-bba1-89d505cda6ab) Example AC Trade-off for one of the transformer layers: ![Screenshot 2024-10-11 at 10 09 58 PM](https://github.com/user-attachments/assets/5b2f343c-7e73-4c7d-bfea-3dcef2caa362) Example AC Trade-Off graph one of the transformer layers: ![Transformer layers 3](https://github.com/user-attachments/assets/490d4b37-a916-4298-a14c-f78ffecbbde2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135208 Approved by: https://github.com/weifengpy	2024-10-14 13:56:40 +00:00
chilli	0e4d42634e	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-14 10:33:43 +00:00
Siddhartha Menon	770c134998	Add SVE implementation of embedding_lookup_idx (#133995 ) Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-10-14 10:17:27 +00:00
cyy	c48fe89011	Make c10::string_view an alias of std::string_view (#130417 ) In order to facilitate the mitigation from c10::string_view to std::string_view, the old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-10-14 09:28:04 +00:00
PyTorch MergeBot	41977a0531	Revert "Port Inductor dataclasses to be kw_only (#137768 )" This reverts commit 65d665bae5b82a54b819c0c4527e7ccf88d19427. Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))	2024-10-13 22:25:19 +00:00
Isuru Fernando	08ce3aac62	Cache some ValueRanges (#137438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438 Approved by: https://github.com/ezyang	2024-10-13 19:23:34 +00:00
GarfieldHan	b361cd01f1	profiler: Fix undefined reference to `unwind_c` in `unwind_entry` while LTO is enabled (#137862 ) With LTO(Link Time Optimization) enabled in CFLAGS, some compiler will optimize and strip the unwind_c function, which is caused by compiler that couldn’t resolve reference correctly, thus breaking the build with undefined reference in unwind_entry. Add an attribute to avoid this bad situation. Fixes #121282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137862 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-13 19:04:58 +00:00
iupaikov-amd	c09b567a91	Fixed error string assertion in test_invalid_devices (#137772 ) ROCm distribution returns different error string for this operation so the test fails this assertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772 Approved by: https://github.com/Skylion007	2024-10-13 18:10:07 +00:00
chilli	65d665bae5	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-13 14:55:45 +00:00
Bin Bao	cfc5d18aad	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-13 14:42:58 +00:00
Bin Bao	b181652f3d	[AOTI] Handle inplace output in ProxyExecutor (#137660 ) Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660 Approved by: https://github.com/chenyang78	2024-10-13 14:42:58 +00:00
cyy	a90b920284	Install llvm18 packages for ASAN workflows (#137335 ) Follows #128763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137335 Approved by: https://github.com/ezyang	2024-10-13 13:49:38 +00:00
FFFrog	4a8e49389c	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) ---- - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-13 12:38:02 +00:00
PyTorch MergeBot	563e9f99c3	Revert "Add device agnostic API for accelerator hooks (#137480 )" This reverts commit 858c91c3d8d9a71c66d0357e51a4bd805f95599f. Reverted https://github.com/pytorch/pytorch/pull/137480 on behalf of https://github.com/albanD due to break all builds on trunk ([comment](https://github.com/pytorch/pytorch/pull/137480#issuecomment-2408954802))	2024-10-13 12:12:37 +00:00
Yuxin Wu	08576b254b	Fix logging in socket.cpp (#137745 ) Formatter shall avoid throwing exceptions as much as possible. Fixes https://github.com/pytorch/pytorch/pull/128673#discussion_r1796226656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137745 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2024-10-13 10:38:10 +00:00
xangma	fe8d66d9a6	Faster Faster BatchSampler (#137423 ) Builds upon #76951. Benchmarking code is the same as in #76950. AMD Ryzen Threadripper PRO 3995WX: ``` batch_size drop_last origin new speedup ------------ ----------- -------- ------ --------- 4 True 0.94 0.5706 64.74% 4 False 0.9745 0.9468 2.93% 8 True 0.7423 0.3715 99.82% 8 False 0.7974 0.5666 40.73% 64 True 0.5394 0.2085 158.76% 64 False 0.6083 0.2697 125.51% 640 True 0.5448 0.1985 174.41% 640 False 0.7085 0.2308 206.91% 6400 True 0.5554 0.2028 173.88% 6400 False 0.7711 0.2109 265.60% 64000 True 0.556 0.2091 165.82% 64000 False 0.7803 0.2078 275.58% ``` When `drop_last == True`, it uses `zip` to speed things up. When `drop_last == False`, it uses `itertools` to speed things up. `itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now. Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423 Approved by: https://github.com/Skylion007	2024-10-13 09:36:03 +00:00
Michael Au-Yeung	b3af359cba	Log WorkNCCL exception string to C10dLogger (#137736 ) Summary: In WorkNCCL::handleException, log to c10d logger with `strings["work_nccl_exception"]`. Test Plan: Test run job to verify NCCL exception is logged. Differential Revision: D62603322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137736 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-10-13 07:33:05 +00:00
zeshengzong	858c91c3d8	Add device agnostic API for accelerator hooks (#137480 ) Make `AcceleratorHooksInterface` consistent for multiple accelerators - Add `showConfig` and `deviceSynchronize` method declaration in `AcceleratorHooksInterface` - Remove unreachable lines of code Pull Request resolved: https://github.com/pytorch/pytorch/pull/137480 Approved by: https://github.com/albanD, https://github.com/FFFrog	2024-10-13 07:19:32 +00:00
Xiaodong Wang	7642f6d047	[AMD] Unify cublaslt and hipblaslt path (#137604 ) Differential Revision: D63967918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137604 Approved by: https://github.com/eqy	2024-10-13 07:11:12 +00:00
Wang, Eikan	fa08e924ad	Skip test export with fake tensor inputs on cuda devices for Intel GPU (#137847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137847 Approved by: https://github.com/etaf, https://github.com/jansel	2024-10-13 07:07:48 +00:00
FFFrog	e3df636580	Fix -Wsign-compare warning spam in Indexing.cu (#137842 ) Detailed Descriptions: Fix for warning spam like ``` warning: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare] ``` ![image](https://github.com/user-attachments/assets/7be3cfff-c33b-4a6e-b52d-04085e6e1bec) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137842 Approved by: https://github.com/ezyang	2024-10-13 07:03:12 +00:00
Xuehai Pan	1d6932937e	[dynamo] fix `NamedTupleVariable` for PyStructSequence (`torch.return_types.`) support (#137776 ) PyStructSequence is the C API equivalent for `collections.namedtuple` in Python. But they have different constructors: ```python tuple = NamedTupleType(args) tuple = NamedTupleType._make(args) tuple = StructSequenceType(args) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137776 Approved by: https://github.com/jansel	2024-10-13 06:46:41 +00:00
Animesh Jain	3050f2e5dd	[dynamo] Check nn modules parameters are not overwritten before taking tracing shortcut (#137824 ) Fixes https://github.com/pytorch/pytorch/issues/136257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137824 Approved by: https://github.com/jansel	2024-10-13 05:04:28 +00:00
abhishek-fujitsu	09e2a0d7bc	fix PyTorch build with Address Sanitizer enabled (#137446 ) Problem Building PyTorch with Address Sanitizer (ASAN) enabled was failing due to a static assertion in KernelFunction_impl.h. The compiler was unable to evaluate FuncPtr::func_ptr() as a constant expression when ASAN was enabled, causing a build error. ``` FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o /usr/bin/ccache /usr/bin/g++-11 -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D_GLIBCXX_SANITIZE_STD_ALLOCATOR -D_GLIBCXX_SANITIZE_VECTOR -Dtorch_cpu_EXPORTS -I/home/abhishekk/stantize/venv/pytorch/build/aten/src -I/home/abhishekk/stantize/venv/pytorch/aten/src -I/home/abhishekk/stantize/venv/pytorch/build -I/home/abhishekk/stantize/venv/pytorch -I/home/abhishekk/stantize/venv/pytorch/cmake/../third_party/benchmark/include -I/home/abhishekk/stantize/venv/pytorch/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/build/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/nlohmann -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api/include -I/home/abhishekk/stantize/venv/pytorch/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/../aten/src -I/home/abhishekk/stantize/venv/pytorch/torch/csrc -I/home/abhishekk/stantize/venv/pytorch/third_party/miniz-2.1.0 -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/include -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/src -I/home/abhishekk/stantize/venv/pytorch/third_party/cpp-httplib -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/FXdiv/include -I/home/abhishekk/stantize/venv/pytorch/c10/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/pthreadpool/include -I/home/abhishekk/stantize/venv/pytorch/third_party/cpuinfo/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/home/abhishekk/stantize/venv/pytorch/third_party/NNPACK/include -I/home/abhishekk/stantize/venv/pytorch/third_party/FP16/include -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/build/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/abhishekk/stantize/venv/pytorch/third_party/fmt/include -I/home/abhishekk/stantize/venv/pytorch/third_party/flatbuffers/include -isystem /home/abhishekk/stantize/venv/pytorch/build/third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/abhishekk/stantize/venv/pytorch/third_party/protobuf/src -isystem /home/abhishekk/stantize/venv/pytorch/third_party/XNNPACK/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/eigen -isystem /home/abhishekk/stantize/venv/pytorch/INTERFACE -isystem /home/abhishekk/stantize/venv/pytorch/third_party/nlohmann/include -isystem /home/abhishekk/stantize/venv/pytorch/build/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include/openmpi -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_PYTORCH_QNNPACK -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_SVE_CPU_DEFINITION -DHAVE_SVE256_CPU_DEFINITION -g -fno-omit-frame-pointer -Og -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -c /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp In file included from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction.h:260, from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:4, from /home/abhishekk/stantize/venv/pytorch/torch/library.h:63, from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:3: /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h: In instantiation of ‘static c10::KernelFunction c10::KernelFunction::makeFromUnboxedFunction(FuncPtr) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; bool AllowLegacyTypes = false]’: /home/abhishekk/stantize/venv/pytorch/torch/library.h:133:59: required from ‘torch::CppFunction::CppFunction(FuncPtr, std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t>) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t> = std::nullptr_t]’ /home/abhishekk/stantize/venv/pytorch/torch/library.h:691:17: required from ‘torch::Library& torch::Library::impl(Name, Func&&, torch::_RegisterOrVerify) & [with Name = const char; Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’ /home/abhishekk/stantize/venv/pytorch/torch/library.h:782:16: required from ‘torch::Library& torch::Library::impl(torch::detail::SelectiveStr<true>, Func&&) & [with Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’ /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:87:9: required from here /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:177:39: error: non-constant condition for static assertion 177 \| static_assert(FuncPtr::func_ptr() != nullptr, "Kernel function cannot be nullptr"); \| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ ``` Testing* - Verified that PyTorch builds successfully with USE_ASAN=ON - Ran PyTorch test suite to ensure no regressions were introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137446 Approved by: https://github.com/ezyang, https://github.com/jgong5	2024-10-13 03:31:54 +00:00
PyTorch MergeBot	70bd58c35f	Revert "Add support for add in tensorify_python_scalars fx pass (#137620 )" This reverts commit 0430e72e755d2c1953917ffb78f00c516eb4bbd5. Reverted https://github.com/pytorch/pytorch/pull/137620 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk `0430e72e75` ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))	2024-10-13 02:05:37 +00:00
PyTorch MergeBot	279052ab86	Revert "Add support for sub in tensorify_python_scalars fx pass (#137622 )" This reverts commit b7924610a0c20f72657548acef7743801189444a. Reverted https://github.com/pytorch/pytorch/pull/137622 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk `0430e72e75` ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))	2024-10-13 02:05:37 +00:00
Jason Ansel	5fee1ee3f4	[inductor] Refactor generate_workspace_allocation (#137673 ) And some other small changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/137673 Approved by: https://github.com/Chillee ghstack dependencies: #137754	2024-10-13 01:25:14 +00:00
Jason Ansel	5146e6a96d	[inductor] Fix reduction_hint sum to single element (#137754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137754 Approved by: https://github.com/Chillee, https://github.com/malfet	2024-10-13 01:08:23 +00:00
Bob Ren	b7924610a0	Add support for sub in tensorify_python_scalars fx pass (#137622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622 Approved by: https://github.com/ezyang ghstack dependencies: #137620	2024-10-13 00:30:02 +00:00
Nichols A. Romero	bd63ec4f45	[ROCm] LoadHIP CMake cleanup (#137112 ) Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313 While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined. This PR does not fix any bugs or provide any new functionality. Strictly clean-up. The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2024-10-13 00:06:41 +00:00
zeshengzong	47c8aa8090	Refactor make device agnostic in accelerator hooks (#137558 ) Make `AcceleratorHooksInterface` consistent for multiple accelerators - Add `getDeviceFromPtr` method declaration in `AcceleratorHooksInterface` - Fix clangtidy warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/137558 Approved by: https://github.com/FFFrog, https://github.com/ezyang	2024-10-12 18:13:54 +00:00
Bob Ren	0430e72e75	Add support for add in tensorify_python_scalars fx pass (#137620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620 Approved by: https://github.com/ezyang ghstack dependencies: #136674, #137588	2024-10-12 17:18:27 +00:00
Wei Wang	e89fe0bd6e	Updating cuda binary build to get cusparselt from PYPI (#137653 ) Fixes #137374 Update 1: such PR require Meta uploading the PYPI package to download.pytorch.org. See: ERROR: Could not find a version that satisfies the requirement nvidia-cusparselt-cu12==0.6.2; platform_system == "Linux" and platform_machine == "x86_64" (from torch) (from versions: none) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137653 Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/atalman	2024-10-12 16:40:37 +00:00
Avik Chaudhuri	ed55d356de	[alt] fix unroll in successive unflatten (#137646 ) We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call. Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.) Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls. Test Plan: Like D64014936 Differential Revision: D64136277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646 Approved by: https://github.com/angelayi	2024-10-12 15:53:52 +00:00
yanbing-j	561f07fae7	Warn users of mkldnn device usage (#137553 ) In https://github.com/pytorch/pytorch/issues/136831, user will use mkldnn device to generate tensor, while mkldnn device is no longer used as device type, and only mkldnn layout is used. We plan to remove mkldnn device related code in the future release. This PR is to warn users not to use mkldnn device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137553 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-12 13:42:12 +00:00
Li, Xingyuan	0dbbcfa7ae	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_pattern_matcher.py` reuse `test/inductor/test_snode_runtime.py` reuse `test/inductor/test_unbacked_symints.py` fix `test/inductor/test_triton_kernels.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-12 13:21:20 +00:00
Yukio Siraichi	030ba03681	Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 ) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909 Approved by: https://github.com/jansel	2024-10-12 12:40:46 +00:00
Jovian Anthony Jaison	6001b16597	Add entire _dynamo.config as a json for logging (#137216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137216 Approved by: https://github.com/ezyang Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-12 11:48:59 +00:00
Angel Yang	a777dea3b3	Remove dtype check on meta device (#136774 ) Summary: # Latest Update This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190. --------------------------------- # Background T176105639 \| case \| embedding bag weight \| per_sample_weight \| fbgemm lookup \| forward in meta \| \| A \| fp32 \| fp32 \| good \| good \| \| B \| fp16 \| fp32 \| good\| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype == per_sample_weights dtype \| \| C \| fp16 \| fp16 \| P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" \| good \| \| D \| fp32 \| fp16 \| N/A \| N/A \| Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype == per_sample_weights dtype ` in meta_registration. The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`. # This diff Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan). # Reference diffs to resolve this issue Diff 1: D52591217 This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks. Diff 2: D53232739 Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models. Test Plan: # APS The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device: ``` buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false ``` Result: {F1461463993} Reviewed By: ezyang Differential Revision: D54175438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774 Approved by: https://github.com/ezyang	2024-10-12 05:45:21 +00:00
Huy Do	92cc319120	Fix masked tensor test_stack memory leak (#137815 ) This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970. When testing locally, calling `backward` on a masked tensor always causes memory leak until I clean up the data and the mask manually. This is probably related to this warning from masked tensor `UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach()`, but I don't know much about the internal details here to go further. So, let's just fix the test first/ ### Testing ``` PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda ``` passes and doesn't warn about memory leak anymore. The test itself came from https://github.com/pytorch/pytorch/pull/125262#issuecomment-2344068012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137815 Approved by: https://github.com/kit1980	2024-10-12 04:30:57 +00:00
Jez Ng	c8609cf4b0	[inductor] Update Triton CPU pin (#137778 ) This incorporates the fix in https://github.com/triton-lang/triton/pull/4871. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137778 Approved by: https://github.com/Skylion007	2024-10-12 03:09:09 +00:00
Eddie Yan	d52b2cf92f	[CUDA][SDPA] Fix TF32 handling and bump threshold for multiheadattention test (#137752 ) For sm90, main issue was that `torch.testing.assert_close` bypasses the `tf32_on_and_off` tolerance switch decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/137752 Approved by: https://github.com/ezyang	2024-10-12 03:05:21 +00:00
Haifeng Jin	2db3f85894	Fixes NumPy 2 test failures in test_torch.py (#137740 ) Related to #107302 The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2. This PR fixes all the corresponding test failures in `test_torch.py`. 1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor. NumPy 1: Return value of `np.float64`. NumPy 2: Return value of `np.float32`. Solution: Enforce it with `.astype(np.float64)`. 2. The type of `np.gradient()` when returning multiple arrays. NumPy1: A list of arrays. NumPy2: A tuple of arrays. Solution: Cast the tuple to a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740 Approved by: https://github.com/ezyang	2024-10-12 02:40:17 +00:00
eqy	6be53d52c5	[CUDA][SDPA] Bump tolerances for `grad_query` in mem_eff test (#137750 ) (for sm80) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137750 Approved by: https://github.com/drisspg	2024-10-12 02:15:14 +00:00
Valentine233	67883e70c0	change GPT2ForSequenceClassification inference accuracy tolerance (#136749 ) Fixes https://github.com/pytorch/pytorch/issues/123503. https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-12 01:12:28 +00:00
Gufan Yin	fba2c0a23a	Fix comment in ProcessGroupGloo (#137746 ) Summary: Algorithm caching was removed in 2018 D13111781 Test Plan: CI Differential Revision: D64214527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137746 Approved by: https://github.com/Skylion007, https://github.com/wz337	2024-10-12 01:04:41 +00:00
Jean Schmidt	69bcf1035e	Updates reference to _runner-determinator.yml workflow, from current version to main version. (#137791 ) Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version. This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs. From: ``` jobs: get-label-type: uses: ./.github/workflows/_runner-determinator.yml ``` To: ``` jobs: get-label-type: uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro	2024-10-12 00:18:50 +00:00
Andrew Gu	e269a5cb09	[TCPStore] Throw value error if passing `world_size=0` to TCPStore (#137792 ) This fixes https://github.com/pytorch/pytorch/issues/137577. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137792 Approved by: https://github.com/fegin, https://github.com/H-Huang ghstack dependencies: #137713, #137721	2024-10-11 23:42:57 +00:00
cyyever	25ac5652d0	[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328 ) Follows #124485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137328 Approved by: https://github.com/eqy	2024-10-11 23:23:57 +00:00
Shivam Raikundalia	8486d3df69	[Profiler] Hide ProfilerStep Alignment behind Experimental Config (#137668 ) Summary: Aligning ProfilerStep# annotation can be useful for visual purposes but it affects downstream tools like HTA to misreport how long each step took. For this reason, lets give users the option to turn on this alignment manually but also turn it off by default Test Plan: Alignment off: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_11_48.2543945.pt.trace.json.gz&bucket=gpu_traces Alignment on: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_08_27.2518391.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D64146115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137668 Approved by: https://github.com/aaronenyeshi	2024-10-11 22:57:05 +00:00
PyTorch MergeBot	0121d64aa9	Revert "[AOTI] Handle inplace output in ProxyExecutor (#137660 )" This reverts commit 573101aac3b1addc0a0b945ae09fe9be9034d3a9. Reverted https://github.com/pytorch/pytorch/pull/137660 on behalf of https://github.com/desertfire due to Fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/137660#issuecomment-2408213485))	2024-10-11 22:54:39 +00:00
PyTorch MergeBot	c58e5c4efa	Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534 )" This reverts commit b0da076f0cd5957c7fe55a58876f3b74babfc1b7. Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))	2024-10-11 22:50:58 +00:00
Will Constable	e3173d8725	[pipelining] Shape Inference (#136912 ) Performs shape inference at runtime using user-provided real tensors. - avoids the need for users to precompute shapes which is difficult and error prone - lets us remove args from the PipelineStage ctor (in a later PR) - deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference The current state as of this PR: - Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically - To override shape inference, they can continue to pass input/output args as previously Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage. If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step). Testing: - Removed input args from all PP test cases, thus exposing them all to shape-inference. - Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-10-11 22:49:00 +00:00
Shangdi Yu	432c3fe5af	Default to use training IR (#137804 ) Summary: Since capture_pre_autograd_graph is deprecated and will be deleted soon, we default this option to true. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D64254236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137804 Approved by: https://github.com/tugsbayasgalan	2024-10-11 22:34:28 +00:00
Jez Ng	c254901bdb	Have Triton custom extension test use privateuseone device (#137611 ) The original PR #122396 used the CPU device since at that point in time there was no actual Triton CPU backend. After #133408, this is no longer the case, so we now have multiple backends getting registered for the CPU. The test still works in OSS but fails internally due to different test runners initializing the backends in a different order. This PR doesn't actually end up fixing the test internally because cpp_extension -- needed to implement the privateuseone device -- isn't supported there, so we simply skip it instead. However, it still makes the OSS test independent of initialization order, which is good. Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611 Approved by: https://github.com/henrylhtsang	2024-10-11 21:27:29 +00:00
Bilal Khan	19bbbef79d	cublaslt autotuning support for TunableOp (#133896 ) Adds support for cublaslt autotuning to TunableOp. Todo: - [x] Add and test `ScaledGemmTunableOp` - [x] Benchmarking numbers Pull Request resolved: https://github.com/pytorch/pytorch/pull/133896 Approved by: https://github.com/eqy, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-10-11 21:16:36 +00:00
PyTorch MergeBot	1358969fa1	Revert "BundledAutotuneCache (#134959 )" This reverts commit 709021143d9c9aa90df578a2f5abb93a91a4852a. Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))	2024-10-11 20:43:56 +00:00
Artemiy Bulavin	74e871355b	Add hooks to Scheduler nodes for generating device-specific debug strings (#135015 ) Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015 Approved by: https://github.com/jansel	2024-10-11 20:30:49 +00:00
eellison	8543000c27	Search through config changes in compiler bisector (#137346 ) Follow up to https://github.com/pytorch/pytorch/pull/131936. In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137346 Approved by: https://github.com/zou3519	2024-10-11 20:24:54 +00:00
Ryan Landay	513563eb09	Fix stack named "queue" in Util::ComputePostOrder (#130526 ) This function computes a topological sort using a non-recursive implementation of DFS. Upon first reading, I thought it was using Kahn’s algorithm because it uses a variable called `queue`, but upon closer reading, I noticed this variable is actually used as a stack. This pull request improves readability by renaming the stack and changing it from `std::vector` to `std::stack`. Note: this also changes the backing store from an `std::vector` to an `std::deque`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130526 Approved by: https://github.com/alanwaketan, https://github.com/malfet	2024-10-11 20:21:07 +00:00
Justin Chu	d0628a7e39	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms ghstack dependencies: #137789	2024-10-11 20:10:04 +00:00
Tugsbayasgalan Manlaibaatar	5fca2fd365	Try unify training and inference (#136888 ) Previously inference -> inference IR was going through a seperate flow from train -> inference decomposition. This diff unifies them so that we always retrace when decomposing. Joint IR decomp is still going through old flow (inference -> inference) but seems ok for now since it is still in experimental stage. Differential Revision: [D63062521](https://our.internmc.facebook.com/intern/diff/D63062521/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136888 Approved by: https://github.com/avikchaudhuri	2024-10-11 20:09:58 +00:00
Justin Chu	3e0b83ad1f	[ONNX] Remove ExportTypes (#137789 ) Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789 Approved by: https://github.com/titaiwangms	2024-10-11 19:29:52 +00:00
Sergii Dymchenko	460358a20f	Run lint-autoformat only on PRs to main (#137802 ) This is mostly to prevent showing up on ghstack PRs, with which code suggestions are not compatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137802 Approved by: https://github.com/huydhn	2024-10-11 19:25:34 +00:00
Jean Schmidt	2cb983ab97	[CI] Adds support for selecting experiments for workflows on runner determinator (#137614 ) adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw: ``` experiments: lf: rollout_perc: 25 otherExp: rollout_perc: 25 default: false --- ``` and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated): ``` get-test-label-type: name: get-test-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... check_experiments: "awsa100" ``` The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS. Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one. ``` jobs: get-build-label-type: name: get-build-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... get-test-label-type: name: get-test-label-type uses: ./.github/workflows/_runner-determinator.yml with: ... check_experiments: "awsa100" linux-focal-cuda12_1-py3_10-gcc9-inductor-build: name: cuda12.1-py3.10-gcc9-sm80 uses: ./.github/workflows/_linux-build.yml needs: - get-build-label-type - get-test-label-type with: runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}" ... test-matrix: \| { include: [ { config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" }, ... ]} ... ``` ``` experiments: lf: rollout_perc: 50 awsa100: rollout_perc: 50 default: false ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614 Approved by: https://github.com/malfet	2024-10-11 19:20:02 +00:00
Aaron Orenstein	709021143d	BundledAutotuneCache (#134959 ) Add a cache to combine individual autotune caches into a single cached bundle. We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later. Various related configs: env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE config: bundled_autotune_remote_cache jk: pytorch/remote_cache:bundled_autotune_remote_cache_version Testing: Manually tested w/ EMU: ``` cd fbcode/accelerators/workloads/models/emu_flash/v1p4 make build_benchmark_model && make save_model_to_path make test_pt2_latency ``` - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss. - perf seems a little better - for 8 runs: - no bundled cache averaged 14m11s - bundled cache averaged 14m6s - 125ms saved per cache entry seems reasonable Cache Metrics for an sample run: no bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0} backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0} backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0} ``` bundled cache: ``` INFO: Cache Metrics: FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0} FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0} FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0} LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0} backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0} backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0} backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0} ``` Differential Revision: D60677499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959 Approved by: https://github.com/oulgen	2024-10-11 19:12:41 +00:00
chilli	b82000c1b3	Removed _compile workaround for create_block_mask (#137477 ) I also put in a change for supporting `create_block_mask` to properly handle non-multiples of BLOCK_SIZE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137477 Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng	2024-10-11 19:04:23 +00:00
Jason Ansel	2dcd69da50	[inductor] Delete dead code and lints (#137753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137753 Approved by: https://github.com/Chillee	2024-10-11 18:55:08 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
Animesh Jain	04adb74d08	[inductor][cond] Remove redundant prefix (#137718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137718 Approved by: https://github.com/eellison ghstack dependencies: #137200	2024-10-11 18:13:18 +00:00
Animesh Jain	cd02c85ba4	[inductor][subgraph][python-wrapper] Lift subgraph code into functions (#137200 ) Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code. This is the output code for test `test_cond_reintepret_view_inputs_outputs` Before this PR - https://www.internalfb.com/intern/paste/P1632948905/ With this PR - https://www.internalfb.com/intern/paste/P1632946348/ A relevant snippet from the above paste is ~~~ def false_graph_0(args): false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args args.clear() s0 = s0 with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32) # Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add] triton_poi_fused_add_sub_1_xnumel = (-20) + (20s0) stream0 = get_raw_stream(0) triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0) del false_graph_0_arg0_1 del false_graph_0_arg1_1 return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), ) async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1 = args args.clear() s0 = arg0_1 assert_size_stride(arg1_1, (s0, 20), (20, 1)) assert_size_stride(arg2_1, (s0, 20), (20, 1)) assert_size_stride(arg3_1, (), ()) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = [None] 2 buf0 = [None] * 2 if arg3_1.item(): # subgraph: true_graph_0 true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0]) buf0[0] = true_graph_0_buf0 buf0[1] = true_graph_0_buf1 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0) false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0) (false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0]) buf0[0] = false_graph_0_buf0 buf0[1] = false_graph_0_buf1 del arg1_1 del arg2_1 del arg3_1 buf1 = buf0[0] buf2 = buf0[1] del buf0 return (buf1, buf2, ) ~~~ The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper. Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov). This work will unblock hierarchical compilation (or cold start compile time work). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200 Approved by: https://github.com/desertfire, https://github.com/eellison	2024-10-11 17:57:10 +00:00
Nikita Shulga	68272ab596	Extend cuda_flip to unsigned types (#137781 ) Using AT_DISPATCH_V2 Test plan: `python3 -c "import torch;print(torch.randint(0, 100, (4, 4), dtype=torch.uint16, device='cuda').flip(0))"` Fixes https://github.com/pytorch/pytorch/issues/137770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137781 Approved by: https://github.com/Skylion007	2024-10-11 17:02:53 +00:00
Nichols A. Romero	4fa46d3bda	TunableOp: Performance Improvement (#135371 ) This PR reduces the overhead on the CPU side by eliminating the use of c10::str in creating signatures. Instead we use fmt library. TunableOp overhead on the CPU are reduced by around ~40%. The improvement is most noticeable on small GEMMs. This PR does not contain any bug fixes or new features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135371 Approved by: https://github.com/jeffdaily	2024-10-11 16:52:40 +00:00
Jeff Daily	da578495ca	[ROCm] enable gfx110x for hipblaslt (#137317 ) Fixes #136347. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137317 Approved by: https://github.com/Skylion007, https://github.com/jithunnair-amd Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-11 16:51:31 +00:00
James Wu	41ccfc8752	Log chromium event for automatic dynamic reasons (#137491 ) Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally. Run following code: ``` import torch @torch.compile(backend="eager") def foo(t, x): return t.sin() + x torch._dynamo.config.automatic_dynamic_shapes = True torch._dynamo.config.assume_static_by_default = True # Change size x = torch.randn([1,2]) foo(x, 2) x = torch.randn([2,2]) foo(x, 2) torch._dynamo.reset() # Change dimensionality x = torch.randn([1,2]) foo(x, 2) x = torch.randn([1,2,3]) foo(x, 2) torch._dynamo.reset() # Change stride x = torch.randn([3,3]) foo(x, 2) x = torch.as_strided(x, [3,3], [2,2]) foo(x, 2) torch._dynamo.reset() # Change scalar x = torch.randn([1,2]) foo(x, 2) foo(x, 3) ``` Internal link to perfetto: https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key The events look like this: <img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab"> <img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656"> <img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba"> <img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc"> Differential Revision: [D64184625](https://our.internmc.facebook.com/intern/diff/D64184625) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2024-10-11 16:50:25 +00:00
Laith Sakka	a06d49a9f9	bump up add_loop_inductor_gpu expected instruction count. (#137672 ) diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu but not enough to fail in that diff, but now its kind of flaky test . it failed on recent merge: <img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611"> and here is the history <img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09"> <img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672 Approved by: https://github.com/masnesral	2024-10-11 16:46:38 +00:00
Aaron Gokaslan	d41558f8d7	[BE][Ez]: Better error message for CUDNN attention attn_bias (#137702 ) Follow up to #136885 . Fixes edge case on error condition (should be early exit so that expand doesn't every run into any trouble with weird cases (attn_bias 0, 1, > 5 dim). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137702 Approved by: https://github.com/eqy	2024-10-11 16:44:46 +00:00
Andrew Gu	5835b1af10	[FSDP2] Gated dynamo import for torch deploy (#137203 ) Differential Revision: [D63777335](https://our.internmc.facebook.com/intern/diff/D63777335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137203 Approved by: https://github.com/wz337	2024-10-11 16:38:19 +00:00
Andrew Gu	bdb42e7c94	[PGNCCL] Added some missing spaces in barrier msg (#137721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137721 Approved by: https://github.com/kwen2501 ghstack dependencies: #137713	2024-10-11 15:17:25 +00:00
Andrew Gu	39c5048549	[DeviceMesh] Fixed `from_group` when passing tensor `mesh` (#137713 ) This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713 Approved by: https://github.com/wz337	2024-10-11 14:53:51 +00:00
Jiong Gong	e30c55ee52	Update maintainers for inductor and x86 CPU (#136839 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136839 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet	2024-10-11 07:24:07 +00:00
drisspg	1c71de5b2c	[ScaleMM] Add a shape dependent max_swizzle size (#137681 ) # Summary I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling. There are more details here: https://github.com/drisspg/transformer_nuggets/pull/36 It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes. This led to seeing if we can improve our triton codegen lowering for _scaled_mm (I think we should still do this: https://github.com/pytorch/pytorch/pull/137517). In the meantime @Chillee suggested I make sure swizziling is set for the large matmul shapes This PR makes sure that we increase the max_swizzle_size for the large matmuls. ## Performance Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster On Nighlty W/ Triton at (2ef33c6c4c3) ![swizzle_tst_8_full_nightly_heatmaps](https://github.com/user-attachments/assets/e92af19b-4e79-4126-b9d0-da039da5363b) You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA. After this PR: ![swizzle_tst_8_full_heatmaps](https://github.com/user-attachments/assets/472068b3-45c2-43f8-84d3-b116da7898d5) For example w/ this change(power limited gpu) M=16384 K=16384 N=16384 TFlops Before :`985.49` TFlops After: `1304.69` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137681 Approved by: https://github.com/eqy	2024-10-11 06:44:31 +00:00
Xia, Weiwen	4e309899c7	[Quant] Check stride > 0 for QConv and QConvTranspose (#136739 ) Fixes #136722 Fixes #136718 By default, it goes to onednn. So this PR adds a check to ensure stride > 0. Now program will quit with an error message if stride is 0. FBGEMM and QNNPACK can create modules with stride=0 without error but program crashes when calling forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136739 Approved by: https://github.com/jgong5	2024-10-11 05:50:37 +00:00
Ke Wen	fe148024fe	[c10d][experimental] Add _abort_process_group (#132291 ) Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: https://github.com/NVIDIA/nccl/issues/1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291 Approved by: https://github.com/eqy	2024-10-11 05:04:17 +00:00
Tugsbayasgalan Manlaibaatar	bc232e3c08	Fix custom op bug of clearing dir (#137655 ) Previously when we delete a custom op out of context manager, we weren't clearing the dir field of the op namespace. As a result, it was polluting other tests. Differential Revision: [D64141465](https://our.internmc.facebook.com/intern/diff/D64141465/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137655 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-10-11 04:32:40 +00:00
Alexander Zinoviev	ee713f80ed	Enable channels_last format for FSDP (#137382 ) Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers. Summary of changes: 1) Store strides information along with shapes 2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening 3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382 Approved by: https://github.com/awgu	2024-10-11 03:47:16 +00:00
Avik Chaudhuri	8ee361ed13	fix test_retrace_pre_autograd (#137733 ) Test Plan: fixed Differential Revision: D64200918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137733 Approved by: https://github.com/pianpwk, https://github.com/tugsbayasgalan	2024-10-11 03:46:22 +00:00
xinan.lin	8321eec009	[Inductor UT] Generalize device bias code in test_triton_kernels.py (#137585 ) [Inductor UT] Generalize device bias code in test_triton_kernels.py introduced by #137020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137585 Approved by: https://github.com/eellison, https://github.com/jansel	2024-10-11 02:00:01 +00:00
Avik Chaudhuri	8262f6d271	fix test_lazy_module_kwargs (#137705 ) Test Plan: fixed Differential Revision: D64185644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137705 Approved by: https://github.com/tugsbayasgalan	2024-10-11 01:53:10 +00:00
Shangdi Yu	9d4cb0d3eb	Fix param and buffer mapping for state_dict when there are state_dict hooks (#137609 ) Resolve #137540 Summary: We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks. For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks. To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict(). Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically. nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy. One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have mod.layers[3].layer.sa_norm.scale. But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"]. In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy. Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly. In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict. The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks. The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state. If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks ``` Differential Revision: D64080561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609 Approved by: https://github.com/angelayi, https://github.com/pianpwk	2024-10-11 01:33:50 +00:00
Richard Barnes	a919742149	c10::optional -> std::optional in PyTorch (#137333 ) Test Plan: Sandcastle Differential Revision: D63876535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137333 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-11 00:16:10 +00:00
PyTorch MergeBot	4fb1fd8a51	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit b6a64dce072240c0b06d2fb03ac81b3ed1b73d92. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
PyTorch MergeBot	b55ff476bd	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit cdd8fa98c77b052085cca65dd54769ae18b72104. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00
Bin Bao	b0da076f0c	[AOTI] Turn on the ABI-compatible mode as default (#136534 ) Summary: Make AOTI generate ABI-compatible code as default for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534 Approved by: https://github.com/chenyang78 ghstack dependencies: #137660	2024-10-10 23:44:57 +00:00
Nikita Shulga	ad38bad766	[MPS] Add `tri[lu]_indices` (#137648 ) Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980 Copy-n-paste kernel implementation from `13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)` though use `float` instead of `double` for square root computation Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #137601, #137647	2024-10-10 23:41:06 +00:00
Bin Bao	573101aac3	[AOTI] Handle inplace output in ProxyExecutor (#137660 ) Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660 Approved by: https://github.com/chenyang78	2024-10-10 23:12:46 +00:00
Justin Chu	c37bb492da	[ONNX] Create an `optimize` method in ONNXProgram (#137667 ) Move optimization from the export call to the `optimize()` method in ONNXProgram. Users can call `optimize()` before calling `save()` to save the model. Right now if users set `optimize=True` in `torch.onnx.export` it will have the same effect as calling `optimize()`, but in the future we can evolve the method to be more flexible (e.g. target aware etc.) Example ```python onnx_program = torch.onnx.export(..., dynamo=True) onnx_program.optimize() onnx_program.save("model.onnx") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137667 Approved by: https://github.com/titaiwangms ghstack dependencies: #137666	2024-10-10 22:44:19 +00:00
Justin Chu	e75984cd31	[ONNX] Use torch_2_6 apis from onnxscript (#137666 ) Create an `optimize=False` option in `torch.onnx.export` for model optimization Pull Request resolved: https://github.com/pytorch/pytorch/pull/137666 Approved by: https://github.com/titaiwangms	2024-10-10 22:23:15 +00:00
William Wen	93bbc8abcc	[dynamo, 3.13] use 3.13 multiline traceback in get_instruction_source_311 (#137617 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137617 Approved by: https://github.com/jansel	2024-10-10 20:19:27 +00:00
William Wen	4551a1ee79	[dynamo, 3.13] merge 3.13 FORMAT_* and <=3.12 FORMAT_VALUE (#137656 ) This was causing some 3.13 failures locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137656 Approved by: https://github.com/jansel, https://github.com/Skylion007 ghstack dependencies: #137652	2024-10-10 19:53:42 +00:00
William Wen	6b2c3508f8	[dynamo, 3.13] fix typo in remove_fused_load_store (#137652 ) Whoops! Pull Request resolved: https://github.com/pytorch/pytorch/pull/137652 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-10-10 19:53:42 +00:00
Scott Wolchok	9c12198137	[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel, Try #2 (#137377 ) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was https://github.com/pytorch/pytorch/pull/136331 . Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137377 Approved by: https://github.com/malfet	2024-10-10 19:44:22 +00:00
Richeek Das	080f02ac7a	[dynamo] do not raise an unimplemented error with boolean masking setitem (#134902 ) Cudagraph breaks on boolean masking setitem, however the code runs fine. There is no need to raise an unimplemented error here, since it already warns that its an incompatible op. Fixes #134241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134902 Approved by: https://github.com/jansel, https://github.com/henrylhtsang	2024-10-10 19:11:40 +00:00
PyTorch MergeBot	079f909263	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit be0b75256a7e516217b059ef273901b95c022fe7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:17 +00:00
PyTorch MergeBot	33e5921e6b	Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526 )" This reverts commit 72ad1b8c6c7c364c1974b82a914876dcdf73af44. Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:16 +00:00
eellison	881a18f25f	Set Cuda context in inductor and dont initialize wrong cuda device in fake_tensor (#137603 ) Previously we would construct tensors with "cuda" device which defaults to device:0 if not cuda context is set. Fix for https://github.com/pytorch/pytorch/issues/124854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137603 Approved by: https://github.com/jansel	2024-10-10 18:25:22 +00:00
Ryan Guo	dd7c2899bd	[dynamo] Properly prune dead cell local variables (#136891 ) This patch updates the `prune_dead_locals` logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again. See added tests for examples; note that a few tests in `test/dynamo/test_higher_order_ops.py` also got updated because we are no longer returning the unnecessary graph output. Fixes #127350, #124653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136891 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519	2024-10-10 18:21:24 +00:00
Haifeng Jin	bcfdb72547	Fix dtype test for NumPy 2 (#137532 ) Related to #107302 The following test fails with NumPy 2. ``` _________ TestNumPyInteropCPU.test_numpy_array_interface_cpu __________ Traceback (most recent call last): File "/usr/local/google/home/haifengj/git/pytorch_np2/test/test_numpy_interop.py", line 415, in test_numpy_array_interface wrapped_x = np.array([1, -2, 3, -4], dtype=dtype) OverflowError: Python integer -2 out of bounds for uint8 To execute this test, run the following from the base repo dir: python test/test_numpy_interop.py TestNumPyInteropCPU.test_numpy_array_interface_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` According to the official warning from NumPy 1, the assigning negative value to a `uint8` is deprecated. The recommended way is to `np.array([1, -2, 3, -4]).astype(np.uint8)` See the following for details. ``` >>> np.array([1, -2, 3, -4], dtype=np.uint8) <stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -2 to uint8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype) will give the desired result (the cast overflows). <stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -4 to uint8 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype) will give the desired result (the cast overflows). array([ 1, 254, 3, 252], dtype=uint8) ``` This PR fixes the test failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137532 Approved by: https://github.com/soulitzer	2024-10-10 18:12:25 +00:00
Menglu Yu	5e73f2d7c0	[PT2][Dynamo][Optimus] Add batch detach, clamp and nan_to_num in pre grad (#137415 ) Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=4 OC_CAUSE=1 buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_math_op_fusion ``` Buck UI: https://www.internalfb.com/buck2/185799e1-6ea8-4bd1-b2e1-0c1a8dd92f89 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275044114335 Network: Up: 14KiB Down: 287B (reSessionID-d24cee56-2a22-4a90-b4c6-1d0c3ab256c1) Jobs completed: 8. Time elapsed: 48.8s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt scripts/shuaiyang:test -- --optimus --flow_id 648108097 2>&1 \| tee ~/local_run_shuai_interformer_cmf.txt ``` Counter({'pattern_matcher_nodes': 6626, 'pattern_matcher_count': 6396, 'extern_calls': 5340, 'benchmarking.TritonBenchmarker.benchmark_gpu': 2710, 'normalization_pass': 44, 'fxgraph_cache_miss': 37, 'scmerge_split_removed': 16, 'scmerge_cat_removed': 16, 'unbind_stack_pass': 16, 'batch_aten_mul': 15, 'batch_linear_post_grad': 12, 'batch_linear': 5, 'batch_detach': 4, 'batch_nan_to_num': 4, 'batch_clamp': 4, 'batch_aten_add': 4, 'batch_layernorm': 2, 'scmerge_cat_added': 2, 'batch_sigmoid': 1, 'scmerge_split_sections_removed': 1, 'unbind_stack_to_slices_pass': 1, 'benchmarking.TritonBenchmarker.triton_do_bench': 1, 'scmerge_split_added': 1, 'fxgraph_cache_hit': 1, 'batch_aten_sub': 1}) https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-10-06-20-53-01/trace.json.gz&bucket=gpu_traces # e2e baseline: f650336422 proposal: f650336607 ### QPS and NE results {F1914975940} {F1914975938} {F1914975939} {F1914975945} > 0.7% QPS gain with NE neutral ### trace analysis Before {F1914990600} After {F1914990015} We reduced green part in the trace introduced by small nan_to_num kernels Differential Revision: D63962711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137415 Approved by: https://github.com/Yuzhen11	2024-10-10 18:11:08 +00:00
cyy	94e12f97dc	[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 ) Follows #137072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404 Approved by: https://github.com/Skylion007	2024-10-10 18:05:34 +00:00
hjhee	20815c7cb9	Intel GPU: mode: add XPU to supported devices list (#137575 ) Kernel for `mode` Op is being ported to https://github.com/intel/torch-xpu-ops/pull/770, this requires adding XPU to supported device type. Additional context: https://github.com/intel/torch-xpu-ops/issues/327 @fengyuan14 @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/137575 Approved by: https://github.com/EikanWang, https://github.com/malfet Co-authored-by: Feng Yuan <feng1.yuan@intel.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-10 17:44:40 +00:00
Ke Wen	cdd8fa98c7	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy ghstack dependencies: #137161	2024-10-10 17:16:34 +00:00
Colin Peppler	9690cacd61	[aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537 ) ### Context Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride. In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128). ``` # buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1)) buf812 = generate_example_value( (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0 ) ``` The fix is to apply the fallback on each symbol. ### Test ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda ========= Invalid __global__ write of size 2 bytes ``` Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537 Approved by: https://github.com/jingsh	2024-10-10 17:11:24 +00:00
Ke Wen	b6a64dce07	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere	2024-10-10 17:11:21 +00:00
Oguz Ulgen	034af88c2d	Add a microbechmark for cache read path (#137607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607 Approved by: https://github.com/jamesjwu	2024-10-10 16:36:18 +00:00
Nikita Shulga	dae60075e0	[BE][MPS] Use `Tensor`->`TensorBase` in OperationUtils.h (#137647 ) As for the most part those helper method need access to only base class methods. Also replace spurious `at::` namespace prefixes, i.e. `at::Tensor`->`Tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137647 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #137601	2024-10-10 16:11:17 +00:00
Max Podkorytov	bcf15d1bb4	[AOTI] Add error check for parsing error string from error code (#137626 ) Currently, there are compilation warnings as below, which are resolved after the fix ``` /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp: In function ‘ihipModuleSymbol_t* loadKernel(std::string, const string&, uint32_t, const std::optional<std::__cxx11::basic_string<char> >&)’: /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:482:25: warning: ignoring returned value of type ‘hipError_t’, declared with attribute nodiscard [-Wunused-result] 482 \| hipDrvGetErrorString(code, &msg); \ \| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:519:5: note: in expansion of macro ‘CUDA_DRIVER_CHECK’ 519 \| CUDA_DRIVER_CHECK(hipModuleLoad(&mod, filePath.c_str())); \| ^~~~~~~~~~~~~~~~~ In file included from /opt/rocm/include/hip/hip_runtime.h:70, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13, from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4: /opt/rocm/include/hip/hip_runtime_api.h:2369:12: note: in call to ‘hipError_t hipDrvGetErrorString(hipError_t, const char)’, declared here 2369 \| hipError_t hipDrvGetErrorString(hipError_t hipError, const char errorString); \| ^~~~~~~~~~~~~~~~~~~~ In file included from /opt/rocm/include/hip/hip_runtime.h:70, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17, from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13, from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4: /opt/rocm/include/hip/hip_runtime_api.h:399:3: note: ‘hipError_t’ declared here 399 \| } hipError_t; \| ^~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137626 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78	2024-10-10 15:14:39 +00:00
Aditya Tewari	575f260229	Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672 ) Motivation Enable SVE vectorization with `torch.compile` Extends PR: #119571 * This PR enables vectorization for codegen part using SVE-256 (vec length) * The changes can be extended to other SVE vec lengths I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile` Test results are for 8 cores on ARM Neoverse_V1 <img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2"> It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-10 13:20:40 +00:00
Thanh Ha	479bd1f300	Hardlock frequent periodic jobs to Meta runners (#137616 ) The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2024-10-10 12:32:16 +00:00
PyTorch MergeBot	f69bf005f7	Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097 )" This reverts commit 4304c68a4c4d742a3ec5266b81f64a85922509c9. Reverted https://github.com/pytorch/pytorch/pull/137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](https://github.com/pytorch/pytorch/pull/137097#issuecomment-2404573266))	2024-10-10 09:29:05 +00:00
Xiaodong Wang	eea1f79a1d	[AMD] use rccl.h instead of rccl/rccl.h (#135472 ) Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header. Test Plan: buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10 w/o this diff, it'll show 2.18 nccl version Differential Revision: D62371434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472 Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa	2024-10-10 08:55:57 +00:00
Robert Hardwick	eaab5cf0f9	Fix torch.compile correctness bug on aarch64+sve due to gcc bug (#137606 ) Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 Fixes #137597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137606 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-10 08:44:53 +00:00
Avik Chaudhuri	365722f606	fix test_constant_output (#137547 ) Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants. Test Plan: fixed test Differential Revision: D64081786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547 Approved by: https://github.com/tugsbayasgalan	2024-10-10 07:48:15 +00:00
Jason Ansel	4e8997744c	[inductor] Enable coordinate descent tuning with max-autotune (#136867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136867 Approved by: https://github.com/Chillee	2024-10-10 07:29:52 +00:00
Kurt Mohler	383eba5229	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Fixes #75240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy	2024-10-10 06:59:08 +00:00
leslie-fang-intel	71010bf097	[Inductor][CPP] generalize the wgt tensor delete (#135101 ) Summary Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-10-10 06:01:09 +00:00
Yifu Wang	ea83c78174	[SymmetricMemory] set the storage_offset of tensors returned by get_buffer() to 0 (#137569 ) It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size. Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569 Approved by: https://github.com/weifengpy ghstack dependencies: #137567	2024-10-10 05:05:58 +00:00
Nikita Lutsenko	96bab021c0	ATen \| Fix header namespace pollution. (#137619 ) Summary: Fixing a warning, so we can enable it globally. Test Plan: Sandcastle-only, no runtime changes. Differential Revision: D64122115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137619 Approved by: https://github.com/Skylion007	2024-10-10 05:04:54 +00:00
Laith Sakka	1aa130e80c	Avoid generating as_strided for alaising views in auto_functionalize_v2 (#137149 ) during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base we create a view that is regenerated by calling aten.alias instead of as_strided for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149 Approved by: https://github.com/zou3519	2024-10-10 05:00:41 +00:00
Valentine233	b5284a01a4	[CPU] remove keyword static for exp_u20 (#137571 ) Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571 Approved by: https://github.com/jgong5	2024-10-10 04:52:04 +00:00
Lu Fang	d170c410f2	Clean up op BC check list (#137634 ) Summary: Remove some stale items Test Plan: CI Differential Revision: D64133246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137634 Approved by: https://github.com/hl475	2024-10-10 04:29:21 +00:00
sanshang	249152475d	fix sequence number for group (#134578 ) Summary: Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay. `ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`. `b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)` However, `WorkNCCL` only has one sequence number member `seq_`. `b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)` We need to match collective and p2p with wait separately. `29b5a462dc` Depend on: https://github.com/pytorch/pytorch/pull/135132 Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o	2024-10-10 04:24:06 +00:00
Finlay Sanders	5aa9f2b660	Fixed issue with nn.Transformer().generate_square_subsequent_mask() (#137654 ) Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype(). Fixes #137186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137654 Approved by: https://github.com/mikaylagawarecki	2024-10-10 03:10:01 +00:00
Nichols A. Romero	b9c9f7f0fa	Document ROCm environment variables and improve CMake messaging to user (#137308 ) Fixes #115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about. The PR improves the documentation and CMake's behavior for ROCM builds. - Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`. - CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily	2024-10-10 03:08:08 +00:00
Laith Sakka	f394fb554b	Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541 ) Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance, I set it up with 20% threshold (8*2)++ others are stable within +-1.5% <img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf"> <img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541 Approved by: https://github.com/aorenste	2024-10-10 02:47:38 +00:00
Jane Xu	f9ed39c989	Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637 Approved by: https://github.com/albanD	2024-10-10 01:23:30 +00:00
Michael Lazos	d50d5df2fb	Add warning for non static grads in optimizer variable (#137554 ) Fixes https://github.com/pytorch/pytorch/issues/112548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137554 Approved by: https://github.com/williamwen42	2024-10-10 01:23:21 +00:00
Miles	f301f6544b	fix bug for fill_empty_deterministic_ not support complex half (#137488 ) Fixes #133157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137488 Approved by: https://github.com/ezyang	2024-10-10 01:21:32 +00:00
Laith Sakka	361046718d	Generate new expected results file when there is failures in diff time benchmarks (#137551 ) The test also add singpost log for the benchmarks that pass. to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv results ``` WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results. PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00% MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it. You can use the new reference expected result stored at path: out.csv. a,instruction count,90,0.01 b,memory,200,0.1 c,something,100,0.1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551 Approved by: https://github.com/aorenste	2024-10-10 01:09:15 +00:00
Edward Z. Yang	d9f4a7d3f9	Simplify find_localzeros (#133325 ) Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D64135230](https://our.internmc.facebook.com/intern/diff/D64135230) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325 Approved by: https://github.com/albanD	2024-10-10 00:52:50 +00:00
Ke Wen	4f45c76806	[PGNCCL] Limit access to ncclComm_ (#137573 ) When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it. `NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`. To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573 Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o ghstack dependencies: #137572	2024-10-10 00:34:05 +00:00
cyy	0739efbd1f	Remove reference of gcc7 from CI scripts (#137339 ) Because gcc7 can't be used to build Pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137339 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-10-10 00:29:29 +00:00
Shuqiang Zhang	47a515d260	[c10d] simplify barrier implementation and further decouple CPU/GPU (#137516 ) synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516 Approved by: https://github.com/fduwjj, https://github.com/kwen2501	2024-10-09 23:55:28 +00:00
Huy Do	51c33c0b72	Increase the runner size of AVX* jobs to 4xlarge (#137633 ) The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner. It looks ok to up the instance size to 4xlarge instead. I missed periodic jobs in https://github.com/pytorch/pytorch/pull/137447 Example periodic failures `de4c2a3b4e` (test_cpu_repro) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137633 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-10-09 23:43:49 +00:00
Edward Z. Yang	4304c68a4c	In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137097 Approved by: https://github.com/angelayi ghstack dependencies: #137091	2024-10-09 23:34:35 +00:00
Edward Z. Yang	6908d8d450	Enable python dispatcher for reinplacing pass (#137091 ) Arguably this should be put somewhere higher up in the stack? Not sure. Xref: https://fb.workplace.com/groups/6829516587176185/permalink/8042762615851570/ There is a repro but I need to fix more bugs before it can be checked in Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137091 Approved by: https://github.com/bdhirsh	2024-10-09 23:34:35 +00:00
Felix Janda	31e334ad9e	[unwind] replace LONG_LONG_MAX by the portable LLONG_MAX (#125043 ) This fixes a compilation error on systems with the musl c library. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125043 Approved by: https://github.com/aaronenyeshi	2024-10-09 23:34:16 +00:00
Yifu Wang	aafa02506e	[CudaDMAConnectivityDetector] improve the detection robustness (#137530 ) - Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent). - Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529	2024-10-09 23:30:16 +00:00
Yifu Wang	fbaf9b62de	[SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529 ) This provides better accuracy without additional cost. Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137529 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474, #137475	2024-10-09 23:30:16 +00:00
Yifu Wang	39c5122a4f	[IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR - Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out` - Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_` - Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks. - Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`). - Removes methods that were made for the python binding. - Replaces nvlink detection logic with `DMAConnectivityDetector`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474	2024-10-09 23:30:16 +00:00
Yifu Wang	e6edfe3928	[SymmetricMemoryOps] create an out-variant for multimem_one_shot_all_reduce (#137474 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473	2024-10-09 23:30:16 +00:00
Bob Ren	b22749712c	type _inductor/optimize_indexing.py (#137599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137599 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-10-09 23:29:47 +00:00
Bob Ren	d67b4f9e5f	type _inductor/quantized_lowerings.py (#137598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137598 Approved by: https://github.com/Skylion007	2024-10-09 23:29:26 +00:00
Bob Ren	9b01d17b8d	Use MetaProxy more pervasively (#137588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137588 Approved by: https://github.com/ezyang ghstack dependencies: #136674	2024-10-09 23:22:03 +00:00
Nikita Shulga	13cf8360d8	[MPS] Fix testing for generator operators (#137601 ) Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs` Discovered that: - `torch.randn` and `torch.randint` fall into the same undefined category - `torch.logspace` is not implemented for MPS - Allow 1.0 absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem) - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601 Approved by: https://github.com/albanD	2024-10-09 23:17:11 +00:00
Bob Ren	48fe0d56d6	Type _inductor/exc.py (#137595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137595 Approved by: https://github.com/Skylion007	2024-10-09 23:15:06 +00:00
Edward Z. Yang	7408742b67	Make ignore_fresh_unbacked_symbols reentrant (#137605 ) I have a test but it requires some other feature work that isn't fully baked. Maybe this will fix an xfail. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137605 Approved by: https://github.com/albanD	2024-10-09 23:08:05 +00:00
Jin Zhou	5516ac5c21	[ROCm] Tunableop record untuned (#128813 ) When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp: - record untuned GEMMs to file. - a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-09 21:59:03 +00:00
Simon Fan	839d3568b0	[compiled autograd] fix -Wuninitialized (#137539 ) https://github.com/pytorch/pytorch/pull/135663#discussion_r1792408353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137539 Approved by: https://github.com/isuruf, https://github.com/Skylion007	2024-10-09 21:16:26 +00:00
Yifu Wang	38027b9b47	[SymmetricMemory] fix a bug where numel calculation overflows when the tensor size is large (#137567 ) Fixes https://github.com/pytorch/pytorch/issues/137145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137567 Approved by: https://github.com/Chillee, https://github.com/weifengpy	2024-10-09 20:45:57 +00:00
Andrew Gu	a93ea617b5	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-09 20:35:09 +00:00
eellison	47af7cc962	Add compiler bisector (#131936 ) This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good \| bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for https://github.com/pytorch/pytorch/issues/126546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936 Approved by: https://github.com/ezyang	2024-10-09 20:34:11 +00:00
Jane Xu	cfe970260a	Clarify opt-einsum usage, fix #127109 (#137596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596 Approved by: https://github.com/albanD	2024-10-09 20:31:24 +00:00
PyTorch MergeBot	c73d2634b9	Revert "Log chromium event for automatic dynamic reasons (#137491 )" This reverts commit 3c1ab9367885fdb0ead5fcc14a22d6934070ca92. Reverted https://github.com/pytorch/pytorch/pull/137491 on behalf of https://github.com/jovianjaison due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/137491#issuecomment-2403360486))	2024-10-09 20:24:12 +00:00
PyTorch MergeBot	16a2c2cfd4	Revert "Introduce torch.sym_sum (#136429 )" This reverts commit 90bed32b986ab1356dc376df3985497cedbe8a29. Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))	2024-10-09 20:08:01 +00:00
Ke Wen	572f506f9c	[c10d] Improve split_group test (#137572 ) Fix 1: `backend1 = pg._get_backend`, here `pg` should be `ng1`. Fix 2: `dist.broadcast` should be called by ranks of subgroup `ng1` only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137572 Approved by: https://github.com/Skylion007	2024-10-09 19:43:57 +00:00
Mikayla Gawarecki	70288c3c2d	Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444 ) Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263 Follow ups: Do the same for maia and mtia ## Motivation With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`. However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy. ## Why is this ok? The following comment on why numpy is necessary for serialization is legacy `c87c9f0a01/torch/_tensor.py (L303-L312)` We no longer do the following, though it was the case 5 years ago in the PR that added this > CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content Instead what now happens is that CPU storage is constructed with data from the file and then moved onto backend device. Old behavior (`legacy_load`): `67adda891a/torch/serialization.py (L620)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444 Approved by: https://github.com/albanD	2024-10-09 19:35:55 +00:00
Andrew Gu	aa61e251d4	[FSDP2] Added `shard_placement_fn` arg (#137496 ) ## Overview This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size. ``` # Example: def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]: largest_dim = largest_dim_size = -1 for dim, dim_size in enumerate(param.shape): if dim_size > largest_dim_size: largest_dim = dim largest_dim_size = dim_size return Shard(largest_dim) fully_shard(module, shard_placement_fn=shard_placement_fn) ``` ## Follow-Ups - Copy kernels: For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496 Approved by: https://github.com/weifengpy ghstack dependencies: #137593	2024-10-09 19:13:32 +00:00
Bob Ren	36133f39db	Tensorify compute on Python scalars (#136674 ) Signed-off-by: Bob Ren <bobrenfb.com> Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674 Approved by: https://github.com/ezyang	2024-10-09 18:51:41 +00:00
Bob Ren	f15edb291a	type _dynamo/trace_wrapped_higher_order_op.py (#137354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-10-09 18:35:28 +00:00
Zhiyong Wang	9a957e2842	[NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin (#137523 ) Summary: NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events. The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin. Test Plan: ``` tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev) # Create a list with same number of elements as world size (aka no. of ranks) # During allgather this list is going to be populated with tensors from all ranks (aka all gather) gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)] # get collective from all ranks if i <= 10 or RANK != 0: dist.all_gather(gathered_tensors, tensor) ``` My script triggers flight recoder. ``` trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0 trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0 ``` Content from /tmp/nccl_trace_rank_0: P1614645283 Differential Revision: D61929401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523 Approved by: https://github.com/c-p-i-o	2024-10-09 18:19:33 +00:00
Ryan Guo	394c143e4e	[dynamo] Fix error when inlining certain nested closure returned by another function (#137510 ) See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context. In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`. Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values. This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`. This fixes #136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in `NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up. Fixes #136814. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510 Approved by: https://github.com/williamwen42	2024-10-09 18:13:57 +00:00
Justin Chu	018dabff20	[ONNX] Implement patch for jit.isinstance (#137592 ) Patch torch.jit.isinstance for users for models to be dynamo exportable. Replaces https://github.com/pytorch/pytorch/pull/137487. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137592 Approved by: https://github.com/titaiwangms, https://github.com/xadupre	2024-10-09 18:06:52 +00:00
Andrew Gu	ceb2fcc5db	[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` (#137593 ) This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593 Approved by: https://github.com/Skylion007	2024-10-09 17:57:11 +00:00
Huanyu He	bae8d5853e	[TorchRec][PT2 compile] enable dynamo in _get_user_embeddings (#136798 ) Summary: # context * enable the `_get_user_embeddings` function * run failed at P1610151892 ``` torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0). (Size-like symbols: u22) ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. Potential framework code culprit (scroll up for full backtrace): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward if M <= 0 or N <= 0: ``` ``` N = prod(inner_dims) # type: ignore[arg-type] M = prod(outer_dims) # type: ignore[arg-type] if M <= 0 or N <= 0: return ( input.new_zeros(input_shape) if output_mask[0] else None, input.new_zeros(input_shape[axis:]) if output_mask[1] else None, input.new_zeros(input_shape[axis:]) if output_mask[2] else None, ) ``` # changes * use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic. * the size `ret[i][j]` could be zero, so the change in V1 isn't valid * for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/) ``` from torch.fx.experimental.symbolic_shapes import guard_size_oblivious if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0): ``` # past * found `u22` was introduced at ``` def _wait_impl(self) -> List[List[int]]: # Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks. if isinstance(self._splits_awaitable, dist.Work): self._splits_awaitable.wait() ret = self._output_tensor.view(self.num_workers, -1).T.tolist() # <------ u22 introduced here if not torch.jit.is_scripting() and is_torchdynamo_compiling(): for i in range(len(ret)): for j in range(len(ret[i])): torch._check_is_size(ret[i][j]) # <---------- my question: why the _check_is_size isn't enough?? torch._check(ret[i][j] > 0) # <------ added by diff V1 ``` Test Plan: # run command ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 \| tee -a `tagT`.`tagH`.log ``` # results * before without enabling `_get_user_embeddings` [14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html) log: P1610151892 {F1889387940} * V1 enable `_get_user_embeddings` with `torch._check(ret[i][j] > 0)` [13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html) {F1889388378} * V2 enable `_get_user_embeddings` with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):` [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html) if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0): Differential Revision: D63424929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798 Approved by: https://github.com/ezyang	2024-10-09 17:19:45 +00:00
James Wu	4d45536e92	Save aot graph code in AOTAutogradCache for logging purposes (#137432 ) Save the string graph code from print_readable Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432 Approved by: https://github.com/bdhirsh ghstack dependencies: #137431	2024-10-09 16:59:08 +00:00
Masaki Kozuki	b71d0ac3b1	remove unused variable (#137565 ) per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565 Approved by: https://github.com/Skylion007	2024-10-09 16:31:43 +00:00
Oguz Ulgen	ae03c0cff3	Add microbenchmark for FxGraphHashDetails.debug_lines (#137506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506 Approved by: https://github.com/jamesjwu	2024-10-09 16:15:05 +00:00
albanD	e945b6600d	Support 3.8 compile again (#137587 ) This is not going to be very reliable since we don't have CI though... Pull Request resolved: https://github.com/pytorch/pytorch/pull/137587 Approved by: https://github.com/Skylion007	2024-10-09 15:54:52 +00:00
Xinran / Allan Rui	1d15dd7891	Fix triton_reshape to properly expand `Min` keyword in triton codegen (#137357 ) Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape ``` Differential Revision: D63850158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357 Approved by: https://github.com/blaine-rister, https://github.com/eellison	2024-10-09 15:53:45 +00:00
Will Feng	de4c2a3b4e	Add AsyncCollectiveTensor isinstance check to test_graph_input_is_async (#137253 ) This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253 Approved by: https://github.com/bdhirsh	2024-10-09 08:06:16 +00:00
Valentine233	ac8954d1ca	[pattern match][SDPA] remove contiguous in sdpa replacement (#136930 ) Fixes a perf issue which is found internally. In the case, we see query(size=[1, 16, 384, 64], stride=[393216, 64, 1024, 1]) in model code. However before entering SDPA, it becomes query(size=[1, 16, 384, 64], stride=[393216, 24576, 64, 1]). This is caused by the [SDPA pattern match](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L130-L132), which applies contiguous to inputs in replacement. This is not necessary as the contiguous doesn't exist in pattern. Furthermore, it could sometimes cause perf issues. Anyway, we can do the additional contiguous in the kernel implementation if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136930 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jgong5	2024-10-09 07:52:38 +00:00
FFFrog	72ad1b8c6c	Make Context to be Device-agnostic Step by Step (2/N) (#136526 ) - add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526 Approved by: https://github.com/ezyang, https://github.com/EikanWang ghstack dependencies: #136519	2024-10-09 07:34:30 +00:00
Avik Chaudhuri	a02093e824	fix test_export_constraints_error_not_in_range (#137500 ) Test Plan: fixed Differential Revision: D64052011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137500 Approved by: https://github.com/tugsbayasgalan	2024-10-09 05:48:14 +00:00
zeshengzong	abb00efc14	Add torch.squeeze parameter description to declare allowed type (#137485 ) Fixes #137422 Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None` as `dim` value directly. ```python >>> import torch >>> x = torch.randn(3,1,2) >>> x.squeeze(dim=None) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Please look up dimensions by name, got: name = None. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485 Approved by: https://github.com/albanD	2024-10-09 05:29:13 +00:00
Huy Do	df114a447e	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-09 05:13:53 +00:00
PyTorch MergeBot	2fff990c16	Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 )" This reverts commit 932b9945c0bc61a11a7db2f52c974cf283d5a2ed. Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))	2024-10-09 04:53:30 +00:00
Jane Xu	972822dea1	Minorly reorder optim kwargs in docs, fixes #137391 (#137531 ) Closes #137391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531 Approved by: https://github.com/albanD	2024-10-09 04:14:45 +00:00
Benjamin Glass	4628fcf41a	Fix ir._WaitKernel (#137401 ) In ABI-compatible mode, AOTInductor could not compile _WaitKernel due to an incorrect outputs list. Add the correct set of outputs, as done in ir._CollectiveKernel.create_out_of_place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137401 Approved by: https://github.com/desertfire ghstack dependencies: #136924	2024-10-09 04:02:30 +00:00
Benjamin Glass	0414aeacd9	AOTInductor: silence linker warnings about executable stacks (#136924 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924 Approved by: https://github.com/desertfire	2024-10-09 04:02:30 +00:00
Jane Xu	ddc7b6d0b4	Removes confusing note, addresses #38006 (#137535 ) Fixes #38006 The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535 Approved by: https://github.com/albanD	2024-10-09 04:00:38 +00:00
Yifu Wang	d3edf4ebf4	[SymmetricMemoryOps] implement two-shot all-reduce (#137473 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472	2024-10-09 03:49:42 +00:00
Yifu Wang	82e55b624f	[SymmetricMemoryOps] implement one_shot_all_reduce (#137472 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472 Approved by: https://github.com/Chillee, https://github.com/weifengpy ghstack dependencies: #137471	2024-10-09 03:49:42 +00:00
Yifu Wang	5d83ee3e32	[SymmetricMemoryOps] refine cross-device barriers (#137471 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471 Approved by: https://github.com/Chillee, https://github.com/weifengpy	2024-10-09 03:49:42 +00:00
Michael Lazos	5f1759a025	[Dynamo] add flex attention mode test (#137121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121 Approved by: https://github.com/yanboliang, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119	2024-10-09 02:29:40 +00:00
Michael Lazos	d5785d4295	[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227	2024-10-09 02:29:40 +00:00
Michael Lazos	0a304d9048	[Dynamo] Handle extracted unbound tensor methods (#137227 ) fixes2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #137114, #137115, #137116, #137117, #137120	2024-10-09 02:29:40 +00:00
Michael Lazos	b3f30c9bc3	[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 ) Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default) Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue) Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120 Approved by: https://github.com/yanboliang, https://github.com/malfet ghstack dependencies: #137114, #137115, #137116, #137117	2024-10-09 02:29:40 +00:00
Michael Lazos	27dee935af	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-09 02:29:40 +00:00
Michael Lazos	38afac2917	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115	2024-10-09 02:29:40 +00:00
Michael Lazos	108b469f78	[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114	2024-10-09 02:29:40 +00:00
Michael Lazos	e41dffbedd	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114 Approved by: https://github.com/yanboliang	2024-10-09 02:29:40 +00:00
leslie-fang-intel	0b8048c78a	Fix AOTI CPP GEMM Template issue without freezing (#136421 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight ``` ReinterpretView( StorageBox( ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1])) ), FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]), origins=OrderedSet([addmm]) ) ``` In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack. Test Plan ``` python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-10-09 02:19:07 +00:00
FFFrog	be0b75256a	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-09 02:13:36 +00:00
Chirag Pandya	384ddab294	[c10d] fix sequence numbers for coalesced operations (#135132 ) Summary: We were erroneously incrementing seq_collective for p2p operations. FIxes issue #134833 Test Plan: Unit tests. TODO: add more unit tests Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135132 Approved by: https://github.com/fduwjj	2024-10-09 01:38:12 +00:00
Sam Larsen	8cbb58cff6	[inductor] Limit cpu copies in autotuning to CUDA devices (#137509 ) Summary: Missed in https://github.com/pytorch/pytorch/pull/136701#discussion_r1792328849: we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137509 Approved by: https://github.com/int3, https://github.com/eellison	2024-10-09 01:31:58 +00:00
Parikshit Shah	932b9945c0	[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314 ) Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes. Test Plan: tbd Reviewed By: Chillee, basilwong Differential Revision: D63714905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314 Approved by: https://github.com/eellison, https://github.com/basilwong Co-authored-by: Parikshit Shah <parikshit@meta.com>	2024-10-09 00:39:29 +00:00
Ke Wen	23c531b3e9	Allow parallelize_module to get device_mesh from ambient context (#134247 ) This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one. Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model. (The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.) For example: ``` class FeedForward(nn.Module): def __init__(self, config: TransformerArgs) -> None: super().__init__() w1 = nn.Linear(config.dim, config.hidden_dim, bias=False) w2 = nn.Linear(config.hidden_dim, config.dim, bias=False) w3 = nn.Linear(config.dim, config.hidden_dim, bias=False) self.w1 = parallelize_module(w1, Colwise) self.w2 = parallelize_module(w2, Rowwise) self.w3 = parallelize_module(w3, Colwise) def forward(self, x: Tensor) -> Tensor: y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x)) # y is a DTensor with Partial placement; we can return it as is. return y # Or we can convert it to Replicate -- there is modeling flexibility here. return y.redistribute(Replicate()) with device_mesh: model = FeedForward(config) # Now model is a model parallelized onto device_mesh y = model(x) ``` The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context. Calling `parallelize_module` from within model hierarchy also saves the use of FQNs as in the out-of-model annotation case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247 Approved by: https://github.com/tianyu-l	2024-10-09 00:19:03 +00:00
Zhenbin Lin	de7f32a205	openreg add pin_memory (#135339 ) Occording to `Next steps` in test/cpp_extensions/open_registration_extension/README.md, add Pinned memory and HostAllocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135339 Approved by: https://github.com/albanD	2024-10-09 00:07:59 +00:00
eellison	8893881867	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang Co-authored-by: eellison <elias.ellison@gmail.com>	2024-10-09 00:05:52 +00:00
eqy	cba3f4f5e3	[CUDA] Clean up asserts in `test_cuda.py` (#137034 ) Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034 Approved by: https://github.com/Skylion007	2024-10-08 23:16:19 +00:00
Jane Xu	b16167874d	Minor SGD docs clarification fixing #137356 , #137352 (#137528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528 Approved by: https://github.com/albanD	2024-10-08 23:05:08 +00:00
Duygu Altinok	2a1829d728	Error message for allow_in_graph decorator and arbitrary function combo (#135972 ) Fixes #103615 Quick error message for non-allowed allow_in_graph decorator and arbitrary function combo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135972 Approved by: https://github.com/anijain2305	2024-10-08 22:48:38 +00:00
eellison	4aed81c0db	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-08 22:36:46 +00:00
Tugsbayasgalan Manlaibaatar	02013da038	Lift restriction on training IR for unflatten (#137470 ) Differential Revision: [D64025578](https://our.internmc.facebook.com/intern/diff/D64025578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137470 Approved by: https://github.com/avikchaudhuri	2024-10-08 22:30:24 +00:00
Justin Chu	81c8a8ada6	[ONNX] Bump onnxscript in CI (#137497 ) To 0.1.0.dev20241008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137497 Approved by: https://github.com/titaiwangms	2024-10-08 21:56:30 +00:00
Joel Schlosser	76ab1ab665	Fix autograd.Function + NJT when an output grad is None (#136875 ) For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws: ``` RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers ``` This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875 Approved by: https://github.com/soulitzer	2024-10-08 21:01:36 +00:00
PyTorch MergeBot	5e3e1c0151	Revert "[FSDP2] Required `mesh_dim_names` for HSDP (#137436 )" This reverts commit 5fb30df7d6ecc25cc7c4c17a8a33d14ddaa7c279. Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))	2024-10-08 20:50:49 +00:00
Edward Z. Yang	b499083a91	Get rid of quadratic tests to has_same_metadata (#136857 ) Fixes https://github.com/pytorch/pytorch/issues/136852 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857 Approved by: https://github.com/isuruf, https://github.com/bdhirsh	2024-10-08 20:49:23 +00:00
PyTorch MergeBot	d34b617bb9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 )" This reverts commit 51bc839b94829f176e3c1b7f62e3448d6028c480. Reverted https://github.com/pytorch/pytorch/pull/137114 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	8c937445ee	Revert "[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 )" This reverts commit b1fd7708bd81d8d52908bf4459ed024471abd803. Reverted https://github.com/pytorch/pytorch/pull/137115 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	e5f9131327	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 )" This reverts commit f9d69cde88ad972ee8fc24413dd0740f4e21562d. Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
PyTorch MergeBot	2d18c2d5e7	Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 )" This reverts commit 941be418d8ec3290d0e3bae0e16a443be26b3075. Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))	2024-10-08 20:33:17 +00:00
Edward Z. Yang	cc75ac084f	Add test for https://github.com/pytorch/pytorch/issues/137087 (#137090 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137090 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-10-08 20:17:03 +00:00
PyTorch MergeBot	5349ee2934	Revert "Parametrize test_lstm_packed (#137447 )" This reverts commit d5493ed579ba41015ffef981832a3f04f94bb6f8. Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))	2024-10-08 20:15:24 +00:00
James Wu	3c1ab93678	Log chromium event for automatic dynamic reasons (#137491 ) Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally. Run following code: ``` import torch @torch.compile(backend="eager") def foo(t, x): return t.sin() + x torch._dynamo.config.automatic_dynamic_shapes = True torch._dynamo.config.assume_static_by_default = True # Change size x = torch.randn([1,2]) foo(x, 2) x = torch.randn([2,2]) foo(x, 2) torch._dynamo.reset() # Change dimensionality x = torch.randn([1,2]) foo(x, 2) x = torch.randn([1,2,3]) foo(x, 2) torch._dynamo.reset() # Change stride x = torch.randn([3,3]) foo(x, 2) x = torch.as_strided(x, [3,3], [2,2]) foo(x, 2) torch._dynamo.reset() # Change scalar x = torch.randn([1,2]) foo(x, 2) foo(x, 3) ``` Internal link to perfetto: https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key The events look like this: <img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab"> <img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656"> <img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba"> <img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491 Approved by: https://github.com/Skylion007, https://github.com/oulgen	2024-10-08 19:53:12 +00:00
cyy	a2396b2dd8	[2/N] Fix extra warnings brought by clang-tidy-17 (#137459 ) Follows #137407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459 Approved by: https://github.com/Skylion007	2024-10-08 19:05:02 +00:00
Brian Hirsh	b41fc14072	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136759	2024-10-08 18:44:13 +00:00
Brian Hirsh	48b8f818b2	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka	2024-10-08 18:44:13 +00:00
Brian Hirsh	53af729a66	add meta for _segment_reduce_backward (#137442 ) reland of https://github.com/pytorch/pytorch/pull/124988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442 Approved by: https://github.com/albanD	2024-10-08 18:40:06 +00:00
Edward Z. Yang	1aac1ffce1	Don't generate implicit value ranges for missing symbols. (#136667 ) Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute. ``` TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667 Approved by: https://github.com/isuruf ghstack dependencies: #136429	2024-10-08 18:12:57 +00:00
Edward Z. Yang	90bed32b98	Introduce torch.sym_sum (#136429 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. update_hint_regression benchmark, before and after: ``` update_hint_regression,compile_time_instruction_count,2648328980 update_hint_regression,compile_time_instruction_count,2563748678 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429 Approved by: https://github.com/isuruf	2024-10-08 18:12:57 +00:00
James Wu	3bf6594d13	Log compile ids to pt2_remote_cache and pt2_compile_events (#137431 ) Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile. Differential Revision: [D63900826](https://our.internmc.facebook.com/intern/diff/D63900826/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137431 Approved by: https://github.com/oulgen	2024-10-08 18:04:48 +00:00
Yuanhao Ji	758dbac308	Add type check for `ord` in `torch.linalg.vector_norm()` and `torch.linalg.matrix_norm()` (#137463 ) fixes #137424, fixes #137460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463 Approved by: https://github.com/lezcano	2024-10-08 17:53:56 +00:00
Shivam Raikundalia	d87835ac32	[Profiler] Clear Out Dangling AppendOnlyLists (#137450 ) Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting. Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully Differential Revision: D64010911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450 Approved by: https://github.com/aaronenyeshi	2024-10-08 17:48:59 +00:00
PyTorch MergeBot	7e8dace0de	Revert "[ROCm] remove caffe2 from hipify (#137157 )" This reverts commit 40d826074546558f6665a4c118335a7725503cac. Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))	2024-10-08 17:45:45 +00:00
PyTorch MergeBot	a8047564ff	Revert "[FlexAttention] Support training bias for eager (#136910 )" This reverts commit 711dacf9845cbc9ea8b3b0fa257309930106712f. Reverted https://github.com/pytorch/pytorch/pull/136910 on behalf of https://github.com/malfet due to torch.library.custom_op looks weird here and it breaks some internal workloads ([comment](https://github.com/pytorch/pytorch/pull/136910#issuecomment-2400434833))	2024-10-08 17:29:02 +00:00
PyTorch MergeBot	0b5ade8a12	Revert "[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 )" This reverts commit 68151fd2889c9752348c2dfdc7c175ee201c0cd3. Reverted https://github.com/pytorch/pytorch/pull/137120 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137120#issuecomment-2400429265))	2024-10-08 17:26:19 +00:00
PyTorch MergeBot	2570d77a26	Revert "type _dynamo/trace_wrapped_higher_order_op.py (#137354 )" This reverts commit a9f7b905de2217eedee6723b0eb83b3ac7406c26. Reverted https://github.com/pytorch/pytorch/pull/137354 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137354#issuecomment-2400424669))	2024-10-08 17:22:40 +00:00
PyTorch MergeBot	76c5bdd2cc	Revert "[Dynamo] Handle extracted unbound tensor methods (#137227 )" This reverts commit 14eabd69152e31d059444310979625542db2aece. Reverted https://github.com/pytorch/pytorch/pull/137227 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137227#issuecomment-2400406384))	2024-10-08 17:12:41 +00:00
PyTorch MergeBot	c88c0e6c65	Revert "[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 )" This reverts commit d255b34c0ac6208633ed5e71d019fa9ae061e1fc. Reverted https://github.com/pytorch/pytorch/pull/137119 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137119#issuecomment-2400401262))	2024-10-08 17:09:26 +00:00
PyTorch MergeBot	cc10ef4645	Revert "[Dynamo] add flex attention mode test (#137121 )" This reverts commit 144665d772f7ec014a4a23f460a632a4a4774f4a. Reverted https://github.com/pytorch/pytorch/pull/137121 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137121#issuecomment-2400389882))	2024-10-08 17:03:34 +00:00
PyTorch MergeBot	11192ceca4	Revert "[FlexAttention] only calculate grads for buffers that require_grad (#137451 )" This reverts commit 9f9d252971ea1de04d349a0460e39e3bfe824eae. Reverted https://github.com/pytorch/pytorch/pull/137451 on behalf of https://github.com/malfet due to Need to revert it in order to be able to backout https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137451#issuecomment-2400385858))	2024-10-08 17:00:59 +00:00
eellison	8184e202d8	Update mutation checking in pattern matcher (#137448 ) Fix for https://github.com/pytorch/pytorch/issues/137229 The current mutation checking is complicated because it works for pre grad IR. When pre grad ir has been traced to OpOverloads checking is much easier. I am also special casing auto functional hop although I discussed with @zou3519 it would be nice to have a way of querying HOPs that mimic schemas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137448 Approved by: https://github.com/zou3519	2024-10-08 16:56:40 +00:00
Avik Chaudhuri	28493efe6e	fix silly mapping issue with torch.Size (#137465 ) Test Plan: added test Differential Revision: D64022949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465 Approved by: https://github.com/yushangdi, https://github.com/angelayi	2024-10-08 16:53:15 +00:00
xadupre	7267363844	[ONNX] Insert contiguous node between transpose and view before calling run_decompositions (#137340 ) Works around #136543. This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context. The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-10-08 16:45:59 +00:00
Andrew Gu	5fb30df7d6	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-08 16:31:18 +00:00
Shangdi Yu	0bfedb13e7	Remove aoti_torch_zero_ codegen (#137371 ) Summary: aoti_torch_zero_ codegen breaks AOTI FC, see discussion in D63281798. Test Plan: CI Differential Revision: D63916320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137371 Approved by: https://github.com/jingsh	2024-10-08 15:57:41 +00:00
Bin Bao	c04b35a5ae	[AOTI] Add standalone version of TORCH_CHECK (#136873 ) Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained. Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873 Approved by: https://github.com/albanD, https://github.com/chenyang78	2024-10-08 15:30:01 +00:00
Huy Do	d5493ed579	Parametrize test_lstm_packed (#137447 ) The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore. Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447 Approved by: https://github.com/albanD, https://github.com/malfet	2024-10-08 15:26:27 +00:00
Joel Schlosser	3e2f276a14	Fix to() on non-contiguous NJTs (#137124 ) Called out via torchrec integration: `lengths` is not handled properly. Future work (not related to non-contiguous NJTs): #137275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124 Approved by: https://github.com/soulitzer ghstack dependencies: #137030, #137031	2024-10-08 15:11:05 +00:00
Edward Z. Yang	a77bb8527c	Make index check in applySelect support deferred runtime assert (#137046 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137046 Approved by: https://github.com/albanD	2024-10-08 14:31:47 +00:00
Thanh Ha	9b2e453e24	Migrate ARM64 Linux binary jobs to runner determinator (#136666 ) Updates ARM64 Linux binary jobs to use the runner determinator. Issue: pytorch/ci-infra#265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136666 Approved by: https://github.com/ZainRizvi	2024-10-08 12:14:06 +00:00
Shuqiang Zhang	76dca1fef3	[c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295 ) code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295 Approved by: https://github.com/kwen2501	2024-10-08 08:53:47 +00:00
drisspg	9f9d252971	[FlexAttention] only calculate grads for buffers that require_grad (#137451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137451 Approved by: https://github.com/Chillee	2024-10-08 07:36:38 +00:00
Xuehai Pan	59cdd8ddf1	Bump optree version to 0.13.0 to enable Python 3.13 and Python 3.13t support (#137396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137396 Approved by: https://github.com/albanD	2024-10-08 06:49:04 +00:00
PyTorch MergeBot	493d0eeef3	Revert "Add support for cat memory planning mms with max autotune (#132554 )" This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176. Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))	2024-10-08 06:21:06 +00:00
Michael Lazos	8ca15e87f5	Update torchbind expecttest from landrace (#137453 ) Update expecttest from torch function mode PR landrace (torch function mode changes output code slightly) Attempted to revert the stack but there were conflicts Pull Request resolved: https://github.com/pytorch/pytorch/pull/137453 Approved by: https://github.com/huydhn	2024-10-08 06:01:29 +00:00
Tugsbayasgalan Manlaibaatar	bb31e3f57e	Add original forward names to schema so that prettify pass works (#136887 ) When we run_decomp, we retrace if it is training IR. As a result, we do need to reliably store the oroiginal forward names when we run decomp. Differential Revision: [D63064453](https://our.internmc.facebook.com/intern/diff/D63064453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136887 Approved by: https://github.com/angelayi	2024-10-08 04:21:02 +00:00
Zhenbin Lin	46525abb71	OpenReg: support multiple executors (#136249 ) From PR https://github.com/pytorch/pytorch/pull/135646 we have split the daemon into drvier/executor, however, current executor stands for all devices and allocate memory all together. In order to better simulate device behavior, here we support multiple executors, each executor stands for one device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136249 Approved by: https://github.com/FFFrog, https://github.com/albanD	2024-10-08 01:37:08 +00:00
Bob Ren	395e098209	type _dynamo/mutation_guard.py (#137350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137350 Approved by: https://github.com/Skylion007	2024-10-08 00:04:34 +00:00
Max Podkorytov	52ba40c6f6	[ROCm][AOTI] add CK backend (#135641 ) Companion to #134379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641 Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78 Co-authored-by: Colin Peppler <colinpeppler@meta.com>	2024-10-07 23:53:58 +00:00
Colin Peppler	2c0b11c79b	forward-fix D63916220 breaking test_cutlass_backend in FBCode (#137435 ) Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0 ``` Differential Revision: D63991961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435 Approved by: https://github.com/jovianjaison	2024-10-07 23:44:04 +00:00
Yuanhao Ji	812f286d4a	Delete duplicate bindings in torch/csrc/autograd/python_torch_functions_manual.cpp (#136711 ) This change deletes the duplicate binding of `torch. _functionalize_mark_mutation_hidden_from_autograd()`, another defination is here: `5c78c6b05a/torch/csrc/autograd/python_torch_functions_manual.cpp (L630-L636)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136711 Approved by: https://github.com/soulitzer	2024-10-07 23:19:06 +00:00
eellison	d558ec0730	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-07 22:49:29 +00:00
Ludvig Bergenstråhle	01bf350967	Fix bmm_sparse_cuda illegal memory access (#131977 ) This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example: ```py import torch ind = torch.tensor([[1], [0], [0]], device="cuda") val = torch.tensor([1.], device="cuda") A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1)) B = torch.zeros((2, 1, 1), device="cuda") C = torch.bmm(A, B) ``` ## Details In the previous code, we may for example end up with the following situation: ``` i : indices_1D[i] ------------------------------------------ 0 : 1 <- start_idx, mid_idx 1 : 1 <- end_idx ... ``` When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977 Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved	2024-10-07 22:47:34 +00:00
William Wen	a6707a7303	[dynamo] log all graph breaks to graph_breaks logging artifact (#137244 ) We were previously not logging all graph breaks (e.g. data dependent jumps) to the graph_breaks logging artifact. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137244 Approved by: https://github.com/jansel	2024-10-07 22:34:27 +00:00
Bob Ren	a9f7b905de	type _dynamo/trace_wrapped_higher_order_op.py (#137354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-10-07 21:57:06 +00:00
PyTorch MergeBot	796c3c3415	Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221 )" This reverts commit 7e13e7dd7e5fc20c0420605aeecb0f902af5326c. Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))	2024-10-07 21:46:13 +00:00
Sam Larsen	319eda9dfd	[inductor] Add API to make post_grad_custom passes cache-able (#137298 ) Summary: See https://github.com/pytorch/pytorch/issues/130772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-10-07 21:11:54 +00:00
Peter Y. Yeh	8aa110cb00	[ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999 ) This pull request enables the int_mm_error tests for rocm 6.0+ . since #122431 landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-10-07 21:10:18 +00:00
Huy Do	46abaa3b0f	Increase parallelnative shards to 4 (#137433 ) The job times out flakily in trunk as its duration is approaching 3.5h https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=parallelnative Pull Request resolved: https://github.com/pytorch/pytorch/pull/137433 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-10-07 21:06:34 +00:00
Sam Larsen	c87c9f0a01	[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701 Approved by: https://github.com/eellison	2024-10-07 19:47:04 +00:00
Ryan Guo	900f57216f	[dynamo] Log a summary of frames Dynamo traced (#137297 ) This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function. Example: ```python import torch @torch.compile def foo(): x = torch.ones([10]) def bar(): y = x + x torch._dynamo.graph_break() z = y * x return z return bar(), bar foo() foo() ``` Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end. ``` ...... I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [ I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * foo /Users/ryanguo99/Documents/work/scratch/test.py:4 I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * bar /Users/ryanguo99/Documents/work/scratch/test.py:7 I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ] I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [] ...... ``` Fixes #118262. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297 Approved by: https://github.com/williamwen42	2024-10-07 19:44:41 +00:00
Pian Pawakapan	f33ffd01f2	[export] fix joint graph metadata (#136011 ) Differential Revision: D62652832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011 Approved by: https://github.com/tugsbayasgalan	2024-10-07 19:36:44 +00:00
Jason Ansel	08b84afda9	[inductor] Fix alignment hint for WorkspaceArg (#137429 ) Alignment hints refer to the base ptr, not the size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137429 Approved by: https://github.com/eellison	2024-10-07 19:32:33 +00:00
PyTorch MergeBot	fe44b6a67f	Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835 )" This reverts commit 40b09edd87fcbe4e63c4db6399ec758d5c34e1b1. Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/jovianjaison due to this pr is causing typecheck errors internally ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2397661940))	2024-10-07 18:59:41 +00:00
Michael Lazos	144665d772	[Dynamo] add flex attention mode test (#137121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119	2024-10-07 18:55:26 +00:00
Michael Lazos	d255b34c0a	[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119 Approved by: https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227	2024-10-07 18:55:26 +00:00
Michael Lazos	14eabd6915	[Dynamo] Handle extracted unbound tensor methods (#137227 ) fixes2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227 Approved by: https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116, #137117, #137120	2024-10-07 18:55:26 +00:00
Michael Lazos	68151fd288	[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120 ) Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default) Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue) Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115, #137116, #137117	2024-10-07 18:55:26 +00:00
Michael Lazos	941be418d8	[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117 Approved by: https://github.com/yanboliang, https://github.com/williamwen42 ghstack dependencies: #137114, #137115, #137116	2024-10-07 18:55:26 +00:00
Michael Lazos	f9d69cde88	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) (#137116 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116 Approved by: https://github.com/yanboliang ghstack dependencies: #137114, #137115	2024-10-07 18:55:26 +00:00
Michael Lazos	b1fd7708bd	[Dynamo] Remove ignored modes workaround (#135502 ) (#137115 ) Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115 Approved by: https://github.com/yanboliang ghstack dependencies: #137114	2024-10-07 18:55:26 +00:00
Michael Lazos	51bc839b94	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) (#137114 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114 Approved by: https://github.com/yanboliang	2024-10-07 18:55:26 +00:00
Bob Ren	ff95ff5d38	type _dynamo/profiler.py (#137351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137351 Approved by: https://github.com/Skylion007	2024-10-07 18:54:33 +00:00
Andrew Gu	aa145dead8	[FSDP2] Fixed mistargeted backward prefetch (#137348 ) If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348 Approved by: https://github.com/weifengpy	2024-10-07 18:10:09 +00:00
PyTorch MergeBot	01c07e7864	Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 )" This reverts commit 8dddd456794f82db5b4e807e9aed1919d3a832da. Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))	2024-10-07 17:56:57 +00:00
cyy	0c0d8c8ff0	[1/N] Fix extra warnings brought by clang-tidy-17 (#137407 ) Before we can use clang-tidy-17 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-10-07 17:53:59 +00:00
Rachel Guo	ceb4ed8450	[AOTI][Tooling][10/n] Add scalar and symbolic type input debug printing support (#137323 ) Summary: - Further added more types for debug value dumping. - Add a test case for symint inputs for debug printer. in real prod model use cases, "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value. Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda ``` Differential Revision: D63864708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323 Approved by: https://github.com/chenyang78	2024-10-07 17:41:40 +00:00
Animesh Jain	04e48ac562	[inductor] Refactor prefix to make it easy to create subclass of PythonWrapper (#137198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137198 Approved by: https://github.com/jansel ghstack dependencies: #137191, #137193	2024-10-07 17:20:58 +00:00
Animesh Jain	e2b72348d0	[inductor] Reuse the subgraph if accessed via same get_attr node (#137193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193 Approved by: https://github.com/jansel ghstack dependencies: #137191	2024-10-07 17:20:58 +00:00
Animesh Jain	7a5eaecd92	[inductor] Correctly keep track of the graph_input_names (#137191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191 Approved by: https://github.com/jansel	2024-10-07 17:20:53 +00:00
Wei Feng	14b4099521	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-07 16:36:31 +00:00
Oguz Ulgen	33461592e2	[TLParse] Include cache hit/miss/bypass in the report name (#137282 ) Makes it easier to tell on glance https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp1xoGc1/index.html <img width="398" alt="image" src="https://github.com/user-attachments/assets/7ed111cb-46d8-4442-a1b2-037d0a8decd8"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137282 Approved by: https://github.com/jamesjwu	2024-10-07 16:00:00 +00:00
James Wu	4db199f15f	Implement Remote AOTAutogradCache (#137278 ) Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name. Test Plan: Run benchmark: TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency See that it cache hits even with local cache removed. Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj New unit tests Reviewed By: oulgen Differential Revision: D63323958 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278 Approved by: https://github.com/oulgen	2024-10-07 15:38:54 +00:00
Angela Yi	f80ed0b831	[export] Custom op meta kernel generation (two pass) (#137277 ) Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi Test Plan: followup diff (D63837739) Differential Revision: D63837740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277 Approved by: https://github.com/zou3519	2024-10-07 15:34:19 +00:00
Joshua Rosenkranz	e20e7a8c38	Fixed developer setup issue in open_registration_extension (#137355 ) This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following: ``` error: 'synchronizeStream' overrides a member function but is not marked 'override' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355 Approved by: https://github.com/albanD, https://github.com/spzala	2024-10-07 15:25:37 +00:00
Yuxin Wu	8c3ab21490	multiprocessing.spawn: allow a grace period when shutdown (#131278 ) When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder). This PR adds a grace period. Default behavior is unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278 Approved by: https://github.com/albanD	2024-10-07 12:37:34 +00:00
vasiliy	a063a82c8b	[redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341 ) Summary: Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341 Approved by: https://github.com/eqy	2024-10-07 00:58:51 +00:00
Nikita Shulga	d1b87e26e5	[BE] Delete empty files (#137376 ) Discovered by running ``` % find aten -type f -size 0 aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/dummy.c aten/src/ATen/native/vulkan/api/StringUtil.cpp aten/src/ATen/native/LegacyBridge.cpp aten/src/ATen/function_wrapper.py aten/src/ATen/cudnn/Exceptions.h ``` Most of them were added by `b774ce54f8` Remove reference to LegacyBridge.cpp from `aten_native_source_non_codegen_list`: `f42f63ee86/build_variables.bzl (L1317)` And reference to `native/quantized/cpu/qnnpack/wrappers/dummy.c` from `f42f63ee86/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl (L440)` Which seems to be a bug from some ancient Android toolchain Pull Request resolved: https://github.com/pytorch/pytorch/pull/137376 Approved by: https://github.com/kit1980, https://github.com/eqy, https://github.com/seemethere, https://github.com/jianyuh, https://github.com/Skylion007	2024-10-06 18:59:04 +00:00
Menglu Yu	0eba7e5451	Revert runtime numeric check in inductor due to increased compilation time (#137324 ) Summary: This diff reverts D63438718 Cause latency regression on multiple models Test Plan: NA Differential Revision: D63872515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137324 Approved by: https://github.com/xuzhao9	2024-10-06 05:23:24 +00:00
angelayi	1dc1b85714	[export] Move swap to a different file (#137134 ) Refactor so that unflattener doesn't become too messy Differential Revision: [D63719648](https://our.internmc.facebook.com/intern/diff/D63719648/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137134 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #136191, #137102	2024-10-06 04:28:18 +00:00
angelayi	fa9cd46d12	[export] Update swap's forward function (#137102 ) Downstream APS code was failing to run the previously swapped module because of some fx.GraphModule forward function weirdness (P1594789677). So to fix this, I just attached a custom forward function which matches the unflattened module's forward function. Differential Revision: [D63683422](https://our.internmc.facebook.com/intern/diff/D63683422/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137102 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #136191	2024-10-06 04:25:36 +00:00
angelayi	52d7704b32	[export] Add optimization passes (#136191 ) Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other). Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten. Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191 Approved by: https://github.com/avikchaudhuri	2024-10-06 04:22:42 +00:00
Jeeja	ad4e91acfe	[fsdp2] based on device, use stream and Event (#136843 ) currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use _get_device_handle by device type to get the class and use this for stream and events. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843 Approved by: https://github.com/awgu	2024-10-06 04:17:47 +00:00
Jez Ng	4061910ba2	Have Triton CPU backend respect max_autotune setting (#137276 ) We would previously do it regardless of the setting's value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137276 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-10-06 03:03:33 +00:00
Yanbo Liang	711dacf984	[FlexAttention] Support training bias for eager (#136910 ) Add training bias eager implementation, take over the original POC from https://github.com/pytorch/pytorch/pull/136076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136910 Approved by: https://github.com/Chillee	2024-10-05 19:34:57 +00:00
Bob Ren	d073223663	turn CompilationCallbackHandler into dataclass (#137312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137312 Approved by: https://github.com/Skylion007 ghstack dependencies: #137181	2024-10-05 19:03:28 +00:00
Catherine Lee	f54e142c58	Remove references to Rockset in trymerge (#137207 ) For the migration to ClickHouse But also Rockset is not used in trymerge anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/137207 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-10-05 12:53:22 +00:00
Jeff Daily	40d8260745	[ROCm] remove caffe2 from hipify (#137157 ) - Remove all "MasqueradingAsCUDA" files and classes. - Do not rename "CUDA" classes to "HIP". Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157 Approved by: https://github.com/eqy	2024-10-05 12:48:54 +00:00
Yanbo Liang	ca38f28543	[FlexAttention] Adjust BlockMask if reusing the one created at larger seqlen (#137255 ) Fixes #136232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137255 Approved by: https://github.com/Chillee	2024-10-05 07:31:32 +00:00
Nikita Shulga	4830bd0dd4	[Doc] Clarify that NaNs are not equal to each other (#137386 ) Fixes https://github.com/pytorch/pytorch/issues/137337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137386 Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/kit1980	2024-10-05 06:19:12 +00:00
Avik Chaudhuri	17718209ea	fix specialization bug in unflatten + preserve_module_call_signature (#137363 ) Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash. Test Plan: added a test Reviewed By: angelayi Differential Revision: D63913750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363 Approved by: https://github.com/angelayi	2024-10-05 04:26:02 +00:00
Nikita Shulga	6d0d7b6e37	[CI][BE] Restore cuda memory allocator setting (#137383 ) By adding `finally:` clause at the end of the test Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383 Approved by: https://github.com/ngimel	2024-10-05 04:16:38 +00:00
PyTorch UpdateBot	0067f586ba	[audio hash update] update the pinned audio hash (#136968 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136968 Approved by: https://github.com/pytorchbot	2024-10-05 04:08:59 +00:00
Yuanhao Ji	4d8b845797	Fix overflow error when `torch.bincount()` handles a large tensor (#136745 ) Fixes #136720 the result in this case says: ``` Traceback (most recent call last): File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module> result = torch.bincount(input) ^^^^^^^^^^^^^^^^^^^^^ RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745 Approved by: https://github.com/Skylion007	2024-10-05 04:04:48 +00:00
soulitzer	d6f340f66c	Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633 ) Thanks @awgu for raising this issue and the small repro From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633 Approved by: https://github.com/albanD	2024-10-04 23:59:46 +00:00
Michal Gallus	79562f3af8	[ROCm] Modify hipify script to work with Windows paths (#135360 ) This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one. Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour. Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2024-10-04 23:43:43 +00:00
albanD	8b6774d381	Clarify comment for error handling of dict getattr (#137381 ) Just a small nit Pull Request resolved: https://github.com/pytorch/pytorch/pull/137381 Approved by: https://github.com/malfet	2024-10-04 23:40:21 +00:00
Tarun Karuturi	f42f63ee86	Add option to disable operator profiling (#136838 ) Summary: X-link: https://github.com/pytorch/executorch/pull/5720 For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time. To disable operator profiling users need to do: ``` etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling); ``` Test Plan: Added test case. Differential Revision: D61883224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838 Approved by: https://github.com/dbort	2024-10-04 22:56:00 +00:00
Andrew Ho	f2d174c051	Update CODEOWNERS (#136278 ) Swap @gokulavasan for @divyanshk as codeowner of torch/utils/data/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136278 Approved by: https://github.com/divyanshk, https://github.com/janeyx99, https://github.com/jansel	2024-10-04 22:42:05 +00:00
albanD	88e54de219	More nogil unsafe API fix (#137142 ) Cover the PyDict APIs and confirms no update needed for PyModule one. The rest was already covered in https://github.com/pytorch/pytorch/pull/136899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-10-04 21:56:34 +00:00
Siddharth Kotapati	e27c0048db	Enable additional tests for MPS CI runs (#134356 ) As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn	2024-10-04 21:52:38 +00:00
Laith Sakka	7c1d93944e	Proper handling of arguments passed by in kwargs inside zip_schema (#137311 ) if the function is ```func(a, b, c) ``` and is called as ```func(a=1, b=.., c=..)``` before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue. This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes and pass arguments as kwargs, we do not process them. ``` for info, arg in torch._library.utils.zip_schema(schema, args, kwargs): handle_aliasing_and_mutation(info, arg) ``` Fix https://github.com/pytorch/pytorch/issues/137057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311 Approved by: https://github.com/zou3519	2024-10-04 21:50:31 +00:00
albanD	c0deec120f	Fix resurrection logic to trigger early enough (#137267 ) Fixes https://github.com/pytorch/pytorch/issues/136358 The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++. Before this PR, they have the following gc methods: Tensor: - tp_clear subtype_clear - tp_traverse THPVariable_subclass_traverse - tp_dealloc THPVariable_subclass_dealloc TensorBase: - tp_clear THPVariable_clear - tp_traverse THPFunction_traverse (fake function that is just an error) - tp_dealloc object_dealloc The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267 Approved by: https://github.com/ezyang, https://github.com/kshitij12345	2024-10-04 21:13:54 +00:00
Nikita Shulga	bd48933323	Run docker builds on Meta account for now (#137358 ) To fix ``` arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-096a3e2616140518b is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/pytorch-linux-jammy-py3-clang18-asan because no resource-based policy allows the ecr:InitiateLayerUpload action ``` Which seems to be doing the trick see https://github.com/pytorch/pytorch/actions/runs/11185419440/job/31098258344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137358 Approved by: https://github.com/huydhn	2024-10-04 20:39:56 +00:00
Andrew Gu	7b3378a39a	[FSDP2] Relaxed even sharding requirement for all-gather extensions (#137005 ) This PR relaxes the even sharding requirement for the all-gather extensions. The `fsdp_pre_all_gather` now expects signature: ```diff def fsdp_pre_all_gather( self, mesh: DeviceMesh, + outer_size: torch.Size, + outer_stride: Tuple[int, ...], module: nn.Module, mp_policy: MixedPrecisionPolicy, ) -> Tuple[Tuple[torch.Tensor, ...], Any]: ``` - Since no one is using this new signature yet, we should be safe to change it. - Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now. - For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005 Approved by: https://github.com/weifengpy	2024-10-04 20:34:20 +00:00
Bob Ren	f4b415da11	type _dynamo/replay_record.py (#137183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137183 Approved by: https://github.com/Skylion007	2024-10-04 20:29:24 +00:00
Avik Chaudhuri	6a6a8b17b8	handle state tensors in training ir path (#137240 ) Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed. Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export` Differential Revision: D63802576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240 Approved by: https://github.com/tugsbayasgalan	2024-10-04 20:23:48 +00:00
Bob Ren	f0ef7fddde	Add ignored/unmaintained comment for capture_autograd_function flag (#137309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137309 Approved by: https://github.com/jansel ghstack dependencies: #136961	2024-10-04 20:02:37 +00:00
Bin Bao	0878739b11	[AOTI] Add C shim for MKLDNN _convolution_pointwise (#137269 ) Differential Revision: [D63875271](https://our.internmc.facebook.com/intern/diff/D63875271) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137269 Approved by: https://github.com/chenyang78, https://github.com/hl475	2024-10-04 19:42:05 +00:00
Benjamin Glass	a968576777	Add lowering for aten.searchsorted (#135701 ) Adds lowering for `aten.searchsorted`. This entails: 1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`. 2. Adding support for striding to `ops.bucketize`. 3. Adding support for sorting tensors to `ops.bucketize`. 4. Adding a lowering for `aten.searchsorted.Tensor`. 5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors. 6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions. Closes #135873 Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701 Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98	2024-10-04 19:26:05 +00:00
eellison	58ec6a360c	force contiguity for all reduce (#137345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137345 Approved by: https://github.com/xmfan	2024-10-04 19:16:38 +00:00
Shangdi Yu	c0a930b104	Change to export_for_training in quantize_pt2e tests (#137233 ) Summary: as title also change it in `prepare_pt2e()` docstring Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization ``` Differential Revision: D63345059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233 Approved by: https://github.com/tugsbayasgalan	2024-10-04 18:33:02 +00:00
Michael Lazos	22e19bd2d7	Add link to torch.compile the missing manual in troubleshooting (#137301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137301 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-10-04 18:19:30 +00:00
Henry Tsang	79195b9453	[inductor] Add kwargs to bypass unexpected keyword argument error (#137329 ) Summary: I tried `TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/fbcode/profile.txt`. TypeError: DebugAutotuner.run() got an unexpected keyword argument 'benchmark_run' Test Plan: ci Differential Revision: D63876103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137329 Approved by: https://github.com/muchulee8	2024-10-04 18:17:56 +00:00
Tugsbayasgalan Manlaibaatar	d2d14d14e3	[RELAND] Fix unlift to preserve aliased constants (#137310 ) Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137310 Approved by: https://github.com/avikchaudhuri	2024-10-04 18:15:52 +00:00
Laith Sakka	8b9cbf22c2	Enable regression test for add loop benchmarks (#136573 ) The red dotted line is 1.5 <img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517"> expected taken from the average. <img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573 Approved by: https://github.com/ezyang	2024-10-04 18:12:08 +00:00
Menglu Yu	ad240018f2	[PT2][Inductor][Reliability] Add back unit test for pad_mm with BF16 (#137231 ) Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100. Test Plan: ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16 ``` Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136 Network: Up: 100KiB Down: 4.3GiB (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6) Jobs completed: 7079. Time elapsed: 3:34.3s. Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D63794727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137231 Approved by: https://github.com/henrylhtsang	2024-10-04 17:49:55 +00:00
Shangdi Yu	b2979f4382	Allow autocast in training ir export (#137287 ) Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check. Test Plan: CI Differential Revision: D63345767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137287 Approved by: https://github.com/angelayi	2024-10-04 17:38:51 +00:00
Colin Peppler	42adadf2f2	[aotinductor] enable CUTLASS backend (#134379 ) ### Context This PR allows CUTLASS kernels usage in AOTI. It does this by: * For any CUTLASS kernels that win during autotuning, compile them as a .so & .o * When creating the final model .so, link all the CUTLASS kernels .o files * Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel) ### Example https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549 ``` TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134379 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-04 17:32:41 +00:00
Jeff Daily	c7b0d4b148	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-04 15:36:29 +00:00
cyy	67908e9111	Enable clang-tidy on torch/csrc/distributed/rpc (#137320 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137320 Approved by: https://github.com/Skylion007	2024-10-04 15:34:05 +00:00
Bin Bao	15c3479db7	[AOTI] Fix _scaled_mm ABI-compatible codegen (#137132 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode. Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132 Approved by: https://github.com/chenyang78 ghstack dependencies: #137008	2024-10-04 14:05:18 +00:00
Bin Bao	5d24ea81d3	[AOTI] Fix cpp wrapper codegen for _scaled_mm (#137008 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work. Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137008 Approved by: https://github.com/hl475	2024-10-04 14:02:46 +00:00
PyTorch MergeBot	f56f7476d3	Revert "Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 )" This reverts commit e4b98b11493914769d15ca8b124c0b5fa1fdd364. Reverted https://github.com/pytorch/pytorch/pull/136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/136909#issuecomment-2393774694))	2024-10-04 14:01:54 +00:00
PyTorch MergeBot	cd17b2645c	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit a93d3873e97973fbc0009245579ee4e4fa7f155a. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))	2024-10-04 13:58:57 +00:00
PyTorch MergeBot	5509207543	Revert "[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331 )" This reverts commit 592e3a3d4069029946ec4c8d103a313806c53a88. Reverted https://github.com/pytorch/pytorch/pull/136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](https://github.com/pytorch/pytorch/pull/136331#issuecomment-2393760135))	2024-10-04 13:52:37 +00:00
Adnan Akhundov	e80f47fb4d	Pass special arguments to user-defined Triton kernels if required (#137236 ) Summary: Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](`762a7d197c/python/triton/runtime/autotuner.py (L106)`) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](`762a7d197c/python/triton/runtime/jit.py (L613)`) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters. This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves #136550. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 6 tests in 6.283s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236 Approved by: https://github.com/zou3519	2024-10-04 07:36:55 +00:00
cyy	6327a71880	[Environment Variable][2/N] Use thread-safe setenv wrapper (#124485 ) This follows #119449 to make setenv thread-safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485 Approved by: https://github.com/eqy	2024-10-04 07:30:51 +00:00
Pian Pawakapan	6dcd773c57	[export] clean up dynamic markers from tensors (#137230 ) Summary: When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`. I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call. We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call. Test Plan: test_export Differential Revision: D63773534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230 Approved by: https://github.com/avikchaudhuri	2024-10-04 06:50:45 +00:00
Yanbo Liang	a408cfcbf1	[torch.compile] torch.vmap supports dynamic shapes + enable flex attention create_block_mask dynamic shapes (#137163 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137163 Approved by: https://github.com/Chillee	2024-10-04 05:16:04 +00:00
Mauricio Villegas	40b09edd87	Add back DistributedDataParallel types that were lost when pyi was removed (#136835 ) When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835 Approved by: https://github.com/kwen2501	2024-10-04 04:44:20 +00:00
Tugsbayasgalan Manlaibaatar	97634e4f82	Rollout infra for executorch migration to training IR (#132703 ) Title Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132703 Approved by: https://github.com/tarun292	2024-10-04 04:33:08 +00:00
rzou	f500cb43bb	Fix torch.library.register_vmap (#137306 ) We didn't support multiple levels of vmap. The main problem is, during the batching rule, we need to exclude the vmap dispatch key (FuncTorchBatched) like how our C++ batching rules do it. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306 Approved by: https://github.com/Chillee	2024-10-04 03:46:35 +00:00
Bob Ren	cfc51c858a	type _dynamo/callback.py (#137181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137181 Approved by: https://github.com/Skylion007	2024-10-04 03:28:52 +00:00
PyTorch MergeBot	9670e9e5b0	Revert "Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899 )" This reverts commit 4f93de895138cc3cb8c4383b480a2d0ecf407e1b. Reverted https://github.com/pytorch/pytorch/pull/136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136899#issuecomment-2392721534))	2024-10-04 03:28:31 +00:00
Yukio Siraichi	e4b98b1149	Add meta functions for `lerp`, `addcmul`, and `addcdiv`. (#136909 ) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909 Approved by: https://github.com/jansel	2024-10-04 02:47:25 +00:00
Bob Ren	a1f1f585ab	clean up error_on_nested_jit_trace flag (#136961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136961 Approved by: https://github.com/jansel	2024-10-04 02:07:54 +00:00
Yifu Wang	d32696249a	[IntraNodeComm] fix a race condition in one-shot all-reduce (#137257 ) One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective. Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257 Approved by: https://github.com/Chillee	2024-10-04 01:41:14 +00:00
Trung Truong	3d3b394e94	[MTIA](3/n) Implement CPU pins functions for MTIA hooks (#137283 ) Summary: Link CPU pins function in MTIA hooks to the host allocator implementation Test Plan: signals unit test in D63709424 Differential Revision: D63352770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137283 Approved by: https://github.com/egienvalue	2024-10-04 01:26:21 +00:00
jakeharmon8	15e127bc3b	[numpy2.0 compat] Fix test_parse_numpy_int_overflow for NumPy 2.0 (#137135 ) NumPy now throws an OverflowError when trying to create np.uint64(-1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137135 Approved by: https://github.com/Skylion007	2024-10-04 01:21:12 +00:00
Bob Ren	13ec343afe	clean up capture_func_transforms flag (#136960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136960 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel	2024-10-04 01:10:52 +00:00
cyyever	6b9b2a126e	Build clang18 image for ASAN tests (#128763 ) Use the latest clang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128763 Approved by: https://github.com/malfet	2024-10-04 00:53:56 +00:00
Ke Wen	a93d3873e9	[Distributed] Fix extra context on device 0 (#135273 ) This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: `1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)` When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- that's where rank 1, 2, ... can create extra context on device 0! ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy	2024-10-04 00:44:02 +00:00
Bin Bao	88e338f4dd	[AOTI] Add C shim for MKLDNN _linear_pointwise (#136999 ) Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136999 Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475	2024-10-04 00:35:10 +00:00
Nikita Shulga	57c02e5a00	[BE] Use helper functions in mps_extension (#137313 ) This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313 Approved by: https://github.com/atalman	2024-10-04 00:26:38 +00:00
Colin Peppler	bc916a5537	[easy] for test_ck_backend enable RE & activate remaining tests for FBCode (#137305 ) Differential Revision: D63859208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137305 Approved by: https://github.com/muchulee8, https://github.com/chenyang78	2024-10-04 00:22:35 +00:00
cyy	60d19cb59e	Enable clang-tidy on torch/csrc/distributed/autograd/* (#137180 ) Enable clang-tidy on `torch/csrc/distributed/autograd/*` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137180 Approved by: https://github.com/Skylion007	2024-10-03 23:49:23 +00:00
rzou	7e13e7dd7e	Disallow FakeTensor.data_ptr access in eager mode (#137221 ) Previously we raised a deprecation warning (beginning PyTorch 2.4). Now that we are on 2.6, we're completing the deprecation and disallowing this behavior. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221 Approved by: https://github.com/albanD, https://github.com/eellison	2024-10-03 23:47:55 +00:00
Justin Chu	cfcd0e1fe9	[ONNX] Update the faketensor documentation (#137292 ) Update the faketensor documentation to reflect current usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137292 Approved by: https://github.com/shubhambhokare1, https://github.com/sdpython	2024-10-03 23:27:11 +00:00
Shangdi Yu	4096ed7dc2	Migrate to training ir in quantization_pt2e_qat unittests (#137232 ) Summary: Change capture_pre_autograd_graph to export_for_training in unit tests. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat ``` Reviewed By: tugsbayasgalan Differential Revision: D63336660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137232 Approved by: https://github.com/angelayi	2024-10-03 22:57:04 +00:00
Nikita Shulga	b44f25e1ba	[CI] Move s390 binary build to its own workflow (#137304 ) It was added by https://github.com/pytorch/pytorch/pull/125399 and takes 3 hours to finish Considering limited number of runners, it often causes queueing see: <img width="402" alt="image" src="https://github.com/user-attachments/assets/5c67c1d6-af4c-4453-a089-aa1174513bfa"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137304 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/atalman	2024-10-03 22:31:36 +00:00
David Berard	54094c0c26	[inductor][user triton] Check size hints to determine indexing dtype (#137234 ) Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64. This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64. Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234 Approved by: https://github.com/eellison	2024-10-03 22:07:26 +00:00
Shangdi Yu	c83178d894	Change to export_for_training in XNNPACK tests (#137238 ) Summary: as title Test Plan: CI Differential Revision: D63344674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238 Approved by: https://github.com/tugsbayasgalan	2024-10-03 21:28:05 +00:00
angelayi	ce14f1f0c9	[aoti] Accept constant inputs (#137197 ) Fixes https://fb.workplace.com/groups/1028545332188949/posts/1056788036031345/?comment_id=1056790162697799&reply_comment_id=1057501845959964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137197 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire, https://github.com/hl475	2024-10-03 20:59:33 +00:00
eqy	46f158bfbc	[cuDNN] Check shapes during graph capture in cuDNN CTCLoss (#130071 ) Found out from #125952 about the existence of `_assert_async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130071 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-10-03 20:10:28 +00:00
Scott Wolchok	592e3a3d40	[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331 ) ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes https://github.com/pytorch/pytorch/pull/127488 . Includes https://github.com/pytorch/executorch/pull/5444 . Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136331 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #136445	2024-10-03 18:18:37 +00:00
Scott Wolchok	c8a7da305b	[PyTorch] Add attribute version of C10_ALWAYS_INLINE (#136445 ) Sometimes (such as on a lambda), you need `__attribute__((always_inline))` but not `inline`. Differential Revision: [D63266917](https://our.internmc.facebook.com/intern/diff/D63266917/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136445 Approved by: https://github.com/malfet	2024-10-03 18:18:37 +00:00
PyTorch MergeBot	525f6715bc	Revert "Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162 )" This reverts commit f96020c246aec8514b945d530879635a03294f70. Reverted https://github.com/pytorch/pytorch/pull/137162 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but many jobs are failing with NameError: name _recursive_getattr is not defined + a Lint job fails ([comment](https://github.com/pytorch/pytorch/pull/137162#issuecomment-2392036062))	2024-10-03 18:17:56 +00:00
fduwjj	c7714b8d8d	[FR] Fix duplicate output for the case when not all ranks join on collective (#137256 ) As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again. Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries. In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now) P1626755348 Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256 Approved by: https://github.com/c-p-i-o	2024-10-03 18:06:45 +00:00
albanD	adc48a5b52	Python CAPI cleanup (#137266 ) This is unrelated to anything else, but as I was going through the code, fixing bad patterns and a refcount bug (which is unlikely to cause any real issue tbh) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137266 Approved by: https://github.com/Skylion007	2024-10-03 17:55:48 +00:00
Sam Larsen	8bb8c3997b	[inductor] parallel compile: add import of thread_safe_fork for internal (#137155 ) Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload. Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion. Differential Revision: D63736829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155 Approved by: https://github.com/xmfan, https://github.com/eellison	2024-10-03 17:37:21 +00:00
Tugsbayasgalan Manlaibaatar	f96020c246	Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162 ) When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants. Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162 Approved by: https://github.com/pianpwk	2024-10-03 17:28:53 +00:00
James Wu	4d3c0fc061	[AOTAutogradCache] add config for AOTAutograd remote cache (#137011 ) Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff. This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE Test Plan: Existing tests should pass and should build Reviewed By: oulgen Differential Revision: D63321965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011 Approved by: https://github.com/oulgen	2024-10-03 16:03:47 +00:00
Bob Ren	a569a8ac4c	type _dynamo/external_utils.py (#137185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137185 Approved by: https://github.com/Skylion007	2024-10-03 15:18:53 +00:00
Mikayla Gawarecki	b6cb174816	Fix serialization for torch.uint16, torch.uint32, torch.uint64 (#137184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137184 Approved by: https://github.com/albanD	2024-10-03 14:56:11 +00:00
Yuanhao Ji	89b7a5d128	Implement `AcceleratorHooksInterface`'s virtual functions `deviceCount()` and `getCurrentDevice()` for CUDA and XPU (#136752 ) Fixes #136751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136752 Approved by: https://github.com/albanD	2024-10-03 14:44:58 +00:00
atalman	63bbf712d8	Add py3.13t linux wheel build (#137127 ) Builder PR required: https://github.com/pytorch/builder/pull/2001 Test PR: https://github.com/pytorch/pytorch/pull/136490/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127 Approved by: https://github.com/albanD	2024-10-03 13:13:48 +00:00
Yifu Wang	38114ec860	[async-tp] fix a race condition that can cause silent correctness issue (#137199 ) Details described in https://github.com/pytorch/pytorch/issues/137171: ![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b) Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`: - Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream. - After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream. NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137199 Approved by: https://github.com/weifengpy	2024-10-03 10:42:37 +00:00
Vincent Moens	f166d62764	Avoid `__ne__` weakref comparison and use identity instead in cache_size.py (#135000 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135000 Approved by: https://github.com/anijain2305	2024-10-03 07:43:58 +00:00
Vincent Moens	bd9517c1ee	cond_batch_rule with boolean pred (#135009 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135009 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel, https://github.com/zou3519	2024-10-03 07:43:30 +00:00
PyTorch MergeBot	0d1701f310	Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 )" This reverts commit 70019074806920f95976fedad775d7570294f635. Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))	2024-10-03 06:22:55 +00:00
Simon Fan	87bf2a8428	[compiled autograd] initialize cudagraph tls from context manager (#136735 ) FIXES https://github.com/pytorch/pytorch/issues/126934. Cudagraphs TLS is initialized on module import, but compiled autograd codepaths might not import it. This causes problems because autograd/compiled autograd will restore TLS state, and in this case will restore the TLS to an uninitialized state Should fix flaky cudagraph tests: https://github.com/pytorch/pytorch/issues/131663, https://github.com/pytorch/pytorch/issues/132108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136735 Approved by: https://github.com/BoyuanFeng, https://github.com/eellison ghstack dependencies: #136059	2024-10-03 06:22:11 +00:00
Simon Fan	b86269fab5	Unify cpp_extension build directory removal (#136059 ) Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-10-03 06:22:11 +00:00
wz337	55c343fa3a	[DTensor] Register replication strategy for a few upsampling interpolate ops (#137201 ) To unblock Llama 3.2 vision's use case to resize positional embeddings for fine-tuning. Context in [workplace post](https://fb.workplace.com/groups/319878845696681/permalink/1271172040567352/). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137201 Approved by: https://github.com/XilunWu	2024-10-03 03:45:39 +00:00
drisspg	84cac3585d	Move _is_static_problem to mm_common (#137150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137150 Approved by: https://github.com/eellison	2024-10-03 02:55:43 +00:00
drisspg	5c0ce8d0a6	Skip Flaky Test: for #134602 (#137226 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137226 Approved by: https://github.com/cpuhrsch	2024-10-03 01:53:59 +00:00
Jez Ng	b3953ff25e	[inductor] Reduce block sizes when using Triton CPU backend (#136612 ) This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612 Approved by: https://github.com/desertfire ghstack dependencies: #135342	2024-10-03 01:48:32 +00:00
Bin Bao	4513fb5c53	[Inductor] Use parametrize to break down some unit tests (#137156 ) Summary: To address the issue that some tests are marked as slow, see https://github.com/pytorch/pytorch/issues/136940#issuecomment-2387227598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137156 Approved by: https://github.com/eellison	2024-10-03 01:43:36 +00:00
Ke Wen	7631a04081	[c10d] Fix the device query story of ProcessGroup (#136790 ) Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`. A concern is that people may not be aware of what it actually does, due to bad naming of this function: `Return the device to use with ``group`` for control flow usage (object collectives, barrier).` The remediation is as follows: - Added a deprecation warning to `_get_pg_default_device`; - Added a private function `_get_object_coll_device` to undertake what it does; - Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790 Approved by: https://github.com/H-Huang	2024-10-03 01:36:22 +00:00
Avik Chaudhuri	cd5d1fe015	unflatten with specialized graphs per submodule call (#137013 ) Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example. ``` class N(torch.nn.Module): def forward(self, x, b): if b: return x + 1 else: return x + 2 class M(torch.nn.Module): def __init__(self): super().__init__() self.n = N() def forward(self, x): x0 = x + 3 x1 = self.n(x0, True) x2 = x1 + 4 x3 = self.n(x2, False) return x3 + 5 m = M() inp = (torch.ones(1),) print(m(inp)) # tensor([16.]) ep = torch.export.export(m, inp) print(ep.module()(inp)) # tensor([16.]) unflattened = torch.export.unflatten(ep) print(unflattened(inp)) # tensor([15.]) ``` However, this goes against the spirit of specializing graphs when exporting: we should expect* that for every call to a submodule we might generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule. The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected. However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter. A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case. Differential Revision: D63642479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013 Approved by: https://github.com/pianpwk	2024-10-03 00:55:44 +00:00
atalman	6241006c28	Fix dependency on filesystem on Linux (#137209 ) Similar to: https://github.com/pytorch/pytorch/pull/134494 We are seeing come back of https://github.com/pytorch/pytorch/issues/133437 due to use of filesystem on Linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/137209 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-10-03 00:18:28 +00:00
Catherine Lee	235f7e06f4	[CI] upload_metrics function to upload to s3 instead of dynamo (#136799 ) * Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo * Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier * Removes the dynamo_key field since it is no longer needed * Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore * Moves s3 resource initialization behind a function so import is faster --- Tested by emitting a metric during run_test and seeing that documents got added to s3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799 Approved by: https://github.com/ZainRizvi	2024-10-02 23:19:28 +00:00
PyTorch MergeBot	2c9e194e23	Revert "[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 )" This reverts commit b50b3b32191e7192a28c54a417891f24df4e4dda. Reverted https://github.com/pytorch/pytorch/pull/135955 on behalf of https://github.com/PaliC due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/135955#issuecomment-2389810936))	2024-10-02 22:46:31 +00:00
drisspg	bb03ef7aca	[FlexAttention] Fix max-autotune when captured buffers are View nodes (#137204 ) ## Summary Originally reported in https://github.com/pytorch-labs/attention-gym/issues/45 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137204 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2024-10-02 22:19:33 +00:00
Shivam Raikundalia	759cd73adb	[Profiler] Update Kineto Submodule (#137137 ) Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024 Test Plan: Phabricator + OSS CI Differential Revision: D63723255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137 Approved by: https://github.com/aaronenyeshi	2024-10-02 22:19:28 +00:00
Dan Zimmerman	e9e5d767b6	[inductor] Fix build_paths usage in config.py (#137187 ) Summary: In https://github.com/pytorch/pytorch/pull/136234 we accidentally used the old version of `build_paths`, but in https://github.com/pytorch/pytorch/pull/136952 the API slightly changed. This diff addresses that issue by updating the API usage. Test Plan: CI Reviewed By: ColinPeppler Differential Revision: D63764809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137187 Approved by: https://github.com/ColinPeppler	2024-10-02 22:06:02 +00:00
Joel Schlosser	e95b230fd8	Fix NJT serialization (#137031 ) Fixes #129366 Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed. This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs, Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031 Approved by: https://github.com/soulitzer ghstack dependencies: #137030	2024-10-02 21:41:35 +00:00
eqy	be423a8480	[SDPA] Bump `grad_query` fudge factor for Flash Attention (#135711 ) Tolerance issue for small GPUs e.g., (A16, A2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135711 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-10-02 21:35:00 +00:00
Gabriel Ferns	36fb342ffd	Check for fused kernel before inplace update (#137042 ) Summary: Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so. Test Plan: New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers. Fixes #120217 Here's a diagram of the issue: ![Inline+Fusion](https://github.com/user-attachments/assets/c03308d8-fdbf-40a0-a46d-964ece5f9e6d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042 Approved by: https://github.com/eellison	2024-10-02 21:14:34 +00:00
Shangdi Yu	a3f3773477	Make PT2E work with both IR simultaneously (#135769 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat ``` Differential Revision: D62449830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135769 Approved by: https://github.com/angelayi	2024-10-02 21:05:22 +00:00
Howard Huang	4a9225fa1f	improve get_schedule_class() (#137103 ) Small change to make `get_schedule_class()` take case insensitive schedule names Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103 Approved by: https://github.com/kwen2501	2024-10-02 20:08:25 +00:00
Jane Xu	2d465e4d1d	[non ghstack] Init threadpool with user defined num_threads before default (#137051 ) Very similar to https://github.com/pytorch/pytorch/pull/136793, but adds back `pool->set_thread_count` call as it is still necessary (I am guessing due to the mutex) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137051 Approved by: https://github.com/albanD	2024-10-02 20:02:30 +00:00
Jovian Anthony Jaison	59d7cf7342	Add _dynamo.config inline_inbuilt_nn_modules and specialize_float logging (#137139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137139 Approved by: https://github.com/ezyang	2024-10-02 19:58:38 +00:00
chilli	2b329d3bf1	Fix typo in _normalize ref (#137079 ) I think this should basically make no difference numerically, but it does have some ramifications on things like CSE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137079 Approved by: https://github.com/Skylion007 ghstack dependencies: #136826, #137043, #137049, #137065	2024-10-02 19:06:48 +00:00
Joel Schlosser	6374a19a6e	Fix wrapper subclass serialization with custom sizes / strides (#137030 ) Fixes #130154 This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic. The PyCapsule issue also affects `deepcopy`, so that's fixed here as well. Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030 Approved by: https://github.com/albanD	2024-10-02 18:55:03 +00:00
Xuehai Pan	8962610247	[BE][clang-format] make macro `PyObject_HEAD_INIT(type)` and `PyVarObject_HEAD_INIT(type, size)` have its own line (#136949 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949 Approved by: https://github.com/albanD, https://github.com/eqy ghstack dependencies: #136945	2024-10-02 18:39:22 +00:00
Xuehai Pan	89c37be6b7	[BE][clang-format] make macro `PyObject_HEAD` have its own line (#136945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136945 Approved by: https://github.com/albanD	2024-10-02 18:39:21 +00:00
Xilun Wu	54f50f19eb	[dtensor][experimental] expose DTensor Context Parallel API (#137038 ) Summary expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038 Approved by: https://github.com/wz337, https://github.com/fegin	2024-10-02 18:00:23 +00:00
PyTorch MergeBot	4559cddaf9	Revert "Add py3.13t linux wheel build (#137127 )" This reverts commit 6b7adc12140d3073c5700cc1c48998556489857e. Reverted https://github.com/pytorch/pytorch/pull/137127 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but 2 jobs are failing ([comment](https://github.com/pytorch/pytorch/pull/137127#issuecomment-2389250791))	2024-10-02 17:44:42 +00:00
Wei Feng	b50b3b3219	[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955 ) this PR unblocks unit test with single Float8Linear module. It fixes following error ``` torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs) [rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955 Approved by: https://github.com/vkuzo, https://github.com/eqy	2024-10-02 17:26:45 +00:00
Henry Tsang	c318bafe9c	[inductor mkldnn test][BE] Use parametrize to shorten test run time (#137153 ) Summary: Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`. I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions. Differential Revision: D63723925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153 Approved by: https://github.com/desertfire	2024-10-02 17:20:27 +00:00
Jean Schmidt	466623fb51	[CI] Support for CI GPU test and benchmark on containers (#137169 ) Renames the arc references to container, and add changes required so CI that requires GPU can run on containers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137169 Approved by: https://github.com/huydhn	2024-10-02 17:10:59 +00:00
Jean Schmidt	e3fd4d796f	[CI] Skip sccache for nvcc builds when building for A100 (#137170 ) There is a unknown issue with nvcc builds and sccache, it crashes with: ``` /opt/cache/bin/sccache /usr/local/cuda-12.1/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_py_EXPORTS -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/asmjit/src -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/cpuinfo/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.1/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -MD -MT CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o -MF CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o.d -x cu -c /tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/src/sparse_ops/sparse_index_select.cu -o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o sccache: error: failed to execute compile sccache: caused by: error reading compile response from server sccache: caused by: Failed to read response header sccache: caused by: failed to fill whole buffer ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137170 Approved by: https://github.com/huydhn	2024-10-02 17:07:24 +00:00
Jean Schmidt	d4cf90d282	[BE] [CI] Skip clean gha workspace if CI is running in a container for checkout-pytorch (#137168 ) For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment. The motivation for this change is twofold: * There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated; * In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path; Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168 Approved by: https://github.com/huydhn	2024-10-02 17:04:50 +00:00
Bob Ren	af3e25fea7	remove capture_autograd_function flag (#136959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136959 Approved by: https://github.com/jansel	2024-10-02 16:59:19 +00:00
Bin Bao	bcaa0f5ee9	[CI] Remove nanogpt from perf smoke test (#137176 ) Summary: nanogpt's performance is not stable. Remove it from the perf smoke test. We may want to use another test instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137176 Approved by: https://github.com/eellison	2024-10-02 16:35:04 +00:00
Jeff Daily	7001907480	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-02 16:27:15 +00:00
Max Hu	a954a9ea75	[Inductor] External callable registration API for Matmul tuning candidates (#130774 ) Fixes #[130769](https://github.com/pytorch/pytorch/issues/130769) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130774 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-10-02 15:38:10 +00:00
Animesh Jain	af86a6fdba	[dynamo][user-defined-class] Fallback when object.__new__ fails (#137044 ) Seen in https://github.com/vllm-project/vllm/pull/8949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137044 Approved by: https://github.com/jansel	2024-10-02 14:15:36 +00:00
Yu, Guangye	d29094888b	Use torch.Stream&torch.Event for Dynamo capature (#134850 ) # Motivation This PR aims to solve the multiple Inheritance problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134850 Approved by: https://github.com/yf225, https://github.com/EikanWang	2024-10-02 14:15:33 +00:00
Brian Hirsh	bf73af4b4e	dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878 ) Previously if we had a graph like: ``` triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...) getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr'] getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr'] sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1) mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid) ``` The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning: (1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel) (2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent) Reviewed By: embg Differential Revision: D63551393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878 Approved by: https://github.com/zou3519	2024-10-02 13:52:44 +00:00
Bin Bao	5c2c3ca10b	[Inductor] Fix test_conv2d_unary_cpu_cpp_wrapper failure (#137158 ) Summary: test_conv2d_unary_cpu_cpp_wrapper is failing on ciflow/slow because of mis-handling of inf. This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137158 Approved by: https://github.com/chenyang78	2024-10-02 13:21:35 +00:00
Colin Peppler	d117ec1d6e	[3/3][Inductor] Make CK work in FBCode (#136234 ) Summary: # Context Goal: Enable CK for Inductor in FBCode We split this stack into three diffs to help with review & in case we need to revert anything. # This Diff * Gets us to have CK kernels as an option for GEMM autotuning in Inductor. Reviewed By: zjing14 Differential Revision: D62662705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234 Approved by: https://github.com/tenpercent, https://github.com/chenyang78	2024-10-02 12:17:38 +00:00
atalman	6b7adc1214	Add py3.13t linux wheel build (#137127 ) Builder PR required: https://github.com/pytorch/builder/pull/2001 Test PR: https://github.com/pytorch/pytorch/pull/136490/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127 Approved by: https://github.com/albanD	2024-10-02 11:59:33 +00:00
Will Constable	8c29a0dd0e	[pipelining] Clean up dead code (#136804 ) 'set_requires_grad' dict appears to be always full of "False" values, and we always set requires_grad based on the value of 'has_backward' setting of required_grad field was being repeatedly done during get_fwd_recv_ops, but it should be done just once, so move it to the function that creates recv buffers in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804 Approved by: https://github.com/kwen2501	2024-10-02 11:26:31 +00:00
cyy	862029a1ef	[Distributed] [15/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#137072 ) Follows #136848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137072 Approved by: https://github.com/kwen2501	2024-10-02 10:56:15 +00:00
Bob Ren	ed02309232	type _dynamo/create_parameter_op.py (#136958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136958 Approved by: https://github.com/jansel	2024-10-02 10:23:37 +00:00
Mu-Chu Lee	52d29a2b94	[reland #136389 ] Skip kernel saving if already existed (#137073 ) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073 Approved by: https://github.com/desertfire	2024-10-02 09:27:08 +00:00
zeshengzong	e374d6850a	[distributed][test] Remove unused variable and fix doc typo (#136943 ) Refactor distributed test code: - Fix TODO: Remove unused variable - Fix doc typo - Migrate deprecated method call `load_state_dict` and `save_state_dict` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136943 Approved by: https://github.com/H-Huang	2024-10-02 08:31:53 +00:00
Jason Ansel	e9a55b43a1	[inductor] Support lists of tensors in operatorbench (#136911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136911 Approved by: https://github.com/eellison	2024-10-02 06:41:06 +00:00
Will Feng	a89e3c2490	Add compiled_autograd_kwargs_override Dynamo config (#136967 ) For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support). With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing. With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967 Approved by: https://github.com/xmfan	2024-10-02 06:23:59 +00:00
Nikita Shulga	b51d22b8bb	[BE] [NEON] Use `vshlq_n_u32` instead of `vshlq_u32` (#137122 ) As compiler optimizes it away anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/137122 Approved by: https://github.com/kit1980	2024-10-02 06:18:11 +00:00
chilli	2854d157de	Add type annotations for higher order ops/flex_attention (#137065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137065 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #136826, #137043, #137049	2024-10-02 04:39:25 +00:00
atalman	3b8511dadf	Remove python 3.8 from triton builds (#137141 ) All jobs have switched to Python 3.9. These 3.8 builds no longer necessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/137141 Approved by: https://github.com/albanD	2024-10-02 03:36:54 +00:00
Bin Bao	8e39f2a4a5	[Inductor] Enable a SDPA pattern matching for CUDA (#137085 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/122429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137085 Approved by: https://github.com/eellison	2024-10-02 03:22:11 +00:00
alenawang	18525e185e	Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056 ) Fixes #132950 This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch. This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950). We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly. Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056 Approved by: https://github.com/fduwjj Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>	2024-10-02 01:45:00 +00:00
Ruben Rodriguez Buchillon	f108f88c40	[logging/debugging] handle None (constant) args in debug log (#137032 ) Summary: # Why The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type # What When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error # Background https://github.com/pytorch/pytorch/pull/131594 Test Plan: - execute repro from https://github.com/pytorch/pytorch/issues/135584#issue-2516944089 with and without the edits Differential Revision: D63652564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137032 Approved by: https://github.com/angelayi	2024-10-02 01:43:22 +00:00
Benjamin Glass	f984b88718	Ensure noncontiguous tensor creation tests offsetting (#136396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136055	2024-10-02 00:40:43 +00:00
Benjamin Glass	c7638da558	Lowerings: remove restriction on TensorBox keyword arguments (#136055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136055 Approved by: https://github.com/eellison	2024-10-02 00:40:43 +00:00
abhishek-fujitsu	63d6908da0	fix build error with gcc 12+ (#137092 ) Fixes #127920 This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target. Issue: When building with GCC 12+, the following error occurs: ``` error: argument 1 null where non-null expected [-Werror=nonnull] 431 \| __builtin_memmove(__result, __first, sizeof(_Tp) * _Num); \| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This change ensures that: 1. The flag is only added for GCC 12 or higher 2. The flag is only added if it's supported by the compiler 3. The flag is added specifically to the test_api target, not globally By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings. Test Plan: - Verified successful build with GCC 12 and above - Ensured no regression in builds with earlier GCC versions and other compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092 Approved by: https://github.com/malfet	2024-10-02 00:37:15 +00:00
Angela Yi	d725758210	[ts_converter] Fix prim::If buffer names (#136648 ) Summary: We previously incorrectly handled the following graph, specifically for the node `w.3` in `block0`: ``` graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu), %y.1 : int): %2 : __torch__.___torch_mangle_1.M = prim::CreateObject() %3 : int = prim::Constant[value=20](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:34 %4 : int = prim::Constant[value=10](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:34 %5 : int = prim::Constant[value=1](), scope: M:: %w.1 : int = prim::GetAttr[name="w"](%2), scope: M:: %7 : int = aten::mul(%w.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:25 = prim::SetAttr[name="w"](%2, %7), scope: M:: %h.1 : int = prim::GetAttr[name="h"](%2), scope: M:: %9 : int = aten::mul(%h.1, %3), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:25 = prim::SetAttr[name="h"](%2, %9), scope: M:: %10 : bool = aten::gt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:19 %res.37 : Tensor = prim::If(%10), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:16 block0(): %w.3 : int = prim::GetAttr[name="w"](%2), scope: M:: %res.1 : Tensor = aten::add(%x.1, %w.3, %5), scope: M:: # <string>:5:9 -> (%res.1) block1(): %h.3 : int = prim::GetAttr[name="h"](%2), scope: M:: %res.3 : Tensor = aten::add(%x.1, %h.3, %5), scope: M:: # <string>:5:9 -> (%res.3) %16 : bool = aten::lt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:19 %res : Tensor = prim::If(%16), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:16 block0(): %w : int = prim::GetAttr[name="w"](%2), scope: M:: %res.15 : Tensor = aten::add(%res.37, %w, %5), scope: M:: # <string>:5:9 -> (%res.15) block1(): %h : int = prim::GetAttr[name="h"](%2), scope: M:: %res.21 : Tensor = aten::add(%res.37, %h, %5), scope: M:: # <string>:5:9 -> (%res.21) return (%res) ``` Test Plan: CI Differential Revision: D63399064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136648 Approved by: https://github.com/SherlockNoMad	2024-10-02 00:07:47 +00:00
Sahan Paliskara	8765804542	Continue on error for pytorch autolint (#137104 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137104 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-10-01 22:30:36 +00:00
Zain Rizvi	f0fa460c60	[BE] Add script to keept the runner-determinator scripts in sync (#136794 ) Whenever we update runner_determinator.py it needs to be copied over into _runner-determinator.yml. This is a quick utility script to make that process less tedious Pull Request resolved: https://github.com/pytorch/pytorch/pull/136794 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-10-01 22:26:28 +00:00
albanD	4f93de8951	Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899 ) PyList_GetItem are audited but not other APIs yet (they will be done in a follow up PR to keep this one small enough). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136899 Approved by: https://github.com/colesbury, https://github.com/atalman	2024-10-01 22:05:35 +00:00
Catherine Lee	6baee60e3c	upload test stats: remove nan/inf when uploading (#136877 ) `json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json This is fine if you json.load, but ClickHouse cannot handle it Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877 Approved by: https://github.com/huydhn	2024-10-01 21:47:46 +00:00
eellison	0788d016d6	Update incompatible cudagraph ops skip message (#137015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137015 Approved by: https://github.com/BoyuanFeng	2024-10-01 21:23:36 +00:00
drisspg	34c18887ad	[FlexAttention] Remove restriction on QK headdim > V headdim (#135884 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135884 Approved by: https://github.com/Chillee	2024-10-01 21:17:54 +00:00
Jez Ng	99eb47fb6d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet	2024-10-01 20:43:10 +00:00
PyTorch MergeBot	86b715c5f6	Revert "Skip kernel saving if already existed. (#136389 )" This reverts commit 2521cd387482a70d30e4ea922fa4fe3b488c9f6d. Reverted https://github.com/pytorch/pytorch/pull/136389 on behalf of https://github.com/muchulee8 due to Issue #136940 ([comment](https://github.com/pytorch/pytorch/pull/136389#issuecomment-2386950623))	2024-10-01 20:04:43 +00:00
PyTorch MergeBot	b53ab8b86a	Revert "[dtensor][experimental] expose DTensor Context Parallel API (#137038 )" This reverts commit e23e766cc089b568aa4c0ebf0747ff9b504b8915. Reverted https://github.com/pytorch/pytorch/pull/137038 on behalf of https://github.com/huydhn due to Sorry for reverting your changes but the doc build failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/137038#issuecomment-2386902253))	2024-10-01 19:49:28 +00:00
Menglu Yu	a00f0d5db8	[PT2][Inductor] Add runtime numeric check for the post grad pass (#136724 ) Summary: Similar to D51838043, we further add post grad runtime numeric check since some graph passes are performed at aten-level. Differential Revision: D63438718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136724 Approved by: https://github.com/Yuzhen11	2024-10-01 18:56:01 +00:00
Edward Z. Yang	d61e45283e	Properly interpolate sloc here (#137088 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137088 Approved by: https://github.com/Skylion007	2024-10-01 18:33:03 +00:00
Jun Luo	c2dee8ea9c	enable lazy init for MTIA (#136902 ) Summary: As title. Test Plan: OSS and Internal CIs Reviewed By: nautsimon, hanzlfs Differential Revision: D63434511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136902 Approved by: https://github.com/nautsimon	2024-10-01 18:30:56 +00:00
Nikita Shulga	1f3a793790	Fix PyTorch builds on MacOS-13 (#137095 ) By including SonomaOps header Fixes https://github.com/pytorch/pytorch/issues/137094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137095 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-10-01 17:56:35 +00:00
Xilun Wu	e23e766cc0	[dtensor][experimental] expose DTensor Context Parallel API (#137038 ) Summary expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038 Approved by: https://github.com/wz337, https://github.com/fegin	2024-10-01 17:41:28 +00:00
Tugsbayasgalan Manlaibaatar	73b07df042	Preserve custom ops via run_decomps (#136882 ) This is re-apply of https://github.com/pytorch/pytorch/pull/136773?fbclid=IwZXh0bgNhZW0CMTEAAR3SmginkvZcILVY7G2XDa_KosnV4DPmq1l6pkjPIM255QgJLKVAR90rGAU_aem_ZWpcVdUsmAGzOGiwbjtBDg. Note that this doesn't completely remove the _preserve_ops list from export mainly because we want to have small change to address failing executorch tests. All the complications included in this PR is deleted in the next PR. Differential Revision: [D63553086](https://our.internmc.facebook.com/intern/diff/D63553086/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136882 Approved by: https://github.com/bdhirsh	2024-10-01 17:38:00 +00:00
Ruben Rodriguez Buchillon	b1b6816e05	[testing] reenable kernel_benchmark.py tests (#136876 ) Summary: # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background (copied from similar issue resolved earlier) It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark Differential Revision: D63498897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876 Approved by: https://github.com/henrylhtsang	2024-10-01 17:16:21 +00:00
Nikita Shulga	3d0cb81594	[MPS] Enable bfloat16 testing (#136987 ) By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`. I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987 Approved by: https://github.com/albanD ghstack dependencies: #137070	2024-10-01 17:10:07 +00:00
Pian Pawakapan	cc2a66c55e	[export] hook up mark_dynamic to export Dims (#137029 ) Adds Dim.DYNAMIC which calls torch._dynamo.mark_dynamic() in the backend. Similar to Dim.AUTO in that it does automatic inference for ranges & relations, but errors out for specializations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137029 Approved by: https://github.com/avikchaudhuri	2024-10-01 17:05:09 +00:00
Isuru Fernando	ef6fd3d780	Fix adaptive_max_pool2d fallback (#136367 ) Fixes #136332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136367 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-01 16:20:34 +00:00
Nikita Shulga	8f4f7bed5d	[MPS] Fix bfloat to complex casts (#137070 ) For Metal cast ops to comple, one need to explicitly cast to/from `bfloat` unlike for other dtypes Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137070 Approved by: https://github.com/Skylion007	2024-10-01 15:47:29 +00:00
PyTorch MergeBot	696d01aef3	Revert "inductor: use previous guards to know if a size is 1 for broadcasting (#136670 )" This reverts commit dfdda2f6a603ae9245f38a3e8f6365c3cb6d49ac. Reverted https://github.com/pytorch/pytorch/pull/136670 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	951107e8c2	Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 )" This reverts commit b17cd264d38ca3381391c449bdaf9f03381caf35. Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
PyTorch MergeBot	923410193b	Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760 )" This reverts commit c010c6099bf304bbb681af534b9f3996c33ce582. Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](`c010c6099b`) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))	2024-10-01 15:23:55 +00:00
Bob Ren	8f5c2b5f17	type _dynamo/test_case.py (#136957 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136957 Approved by: https://github.com/Skylion007	2024-10-01 14:36:22 +00:00
Bob Ren	d4cc2aaf1e	type _dynamo/logging.py (#136956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136956 Approved by: https://github.com/Skylion007	2024-10-01 14:35:54 +00:00
PyTorch MergeBot	7303716005	Revert "Simplify find_localzeros (#133325 )" This reverts commit 99f90c379ed214ab30882a87bdb3924ed6d6c899. Reverted https://github.com/pytorch/pytorch/pull/133325 on behalf of https://github.com/ezyang due to https://fb.workplace.com/groups/gpuinference/permalink/2921405651341417/ ([comment](https://github.com/pytorch/pytorch/pull/133325#issuecomment-2385832600))	2024-10-01 13:25:03 +00:00
Edward Z. Yang	6bd9d37266	Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935, #136972	2024-10-01 13:22:10 +00:00
Edward Z. Yang	cc8f1cddd4	Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934, #136935	2024-10-01 13:22:10 +00:00
Tom Ritchford	b85f21fc1d	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136653	2024-10-01 10:23:22 +00:00
chilli	083921852b	set FlexAttention devices properly during tracing (#137049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137049 Approved by: https://github.com/zou3519, https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #136826, #137043	2024-10-01 09:08:08 +00:00
chilli	34cef1eaa7	Allow automatic dynamic shapes for closures and set current node properly in flexattention subgraph lowering (#137043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137043 Approved by: https://github.com/drisspg ghstack dependencies: #136826	2024-10-01 09:08:08 +00:00
Haifeng Jin	37dd924c2d	Fix test/test_linalg.py for NumPy 2 (#136800 ) Related to #107302. When built and tested with NumPy 2 the following unit tests failed. ``` =========================================================== short test summary info ============================================================ FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor) FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested. FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]). =========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================ ``` This PR fixes them. The test is now compatible with both NumPy 1 & 2. Some more details: 1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2. 2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2. 3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800 Approved by: https://github.com/lezcano	2024-10-01 07:53:24 +00:00
Yu, Guangye	df5bbc09d1	Make device-specific event inherits from torch.Event (#134845 ) # Motivation This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event. And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845 Approved by: https://github.com/albanD, https://github.com/EikanWang	2024-10-01 06:28:41 +00:00
cyy	47a78daf91	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/eqy	2024-10-01 06:24:30 +00:00
Yuanhao Ji	be169f743b	[Dynamo] Mark `config.dead_code_elimination` as deprecated (#136933 ) part of #136862 For reviewers, all call sites are here: https://github.com/search?q=repo%3Apytorch%2Fpytorch+dead_code_elimination+language%3APython&type=code&l=Python Pull Request resolved: https://github.com/pytorch/pytorch/pull/136933 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2024-10-01 03:51:59 +00:00
Simon Fan	6e10f7d8c1	[compiled autograd] undo view_to_reshape inductor fx pass in node name matching (#136741 ) inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741 Approved by: https://github.com/jansel ghstack dependencies: #135663	2024-10-01 03:22:49 +00:00
Simon Fan	40157db5a7	[compiled autograd] log placeholder origin in verbose (#135663 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135663 Approved by: https://github.com/jansel	2024-10-01 03:22:49 +00:00
Henry Tsang	6966811da6	[test] skip not omit big gpu tests for cuda_cpp_wrapper (#137055 ) Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky. Test Plan: ci Differential Revision: D62037908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055 Approved by: https://github.com/masnesral	2024-10-01 03:03:27 +00:00
cyy	17455695d6	[Distributed] [14/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136848 ) Follows #136713 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136848 Approved by: https://github.com/H-Huang	2024-10-01 02:01:13 +00:00
Edward Z. Yang	951af3d3d8	Format torch.fx.experimental.validator (#136935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935 Approved by: https://github.com/Skylion007 ghstack dependencies: #136934	2024-10-01 01:47:17 +00:00
Edward Z. Yang	33c2d3232f	Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934 Approved by: https://github.com/Skylion007	2024-10-01 01:47:16 +00:00
chilli	d9c400bd9f	Added some tests to prevent regressions in partitioning and flexattention (#136826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136826 Approved by: https://github.com/yanboliang, https://github.com/drisspg	2024-10-01 01:08:44 +00:00
niklasz	3f457ee1f6	Fix AOT Graph capture not propagating non_blocking copy parameter to … (#136513 ) …inductor codegen. Fixes #136260 Note: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow. Regarding the bug, re-running the issue's reproduction code: ``` import torch def fn(x): return x.to(device="cuda", non_blocking=True) inp = torch.randn(3, 4) torch.compile(fn)(inp) ``` We now have the non_blocking being passed on to codegen properly: ``` V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] ===== pre insert_deferred_runtime_asserts __compiled_fn_1 ===== V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] <eval_with_key>.0 class GraphModule(torch.nn.Module): V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4]"): V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] l_x_ = L_x_ V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] return (to,) V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] ===== __compiled_fn_1 ===== V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module): V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] def forward(self, L_x_: "f32[3, 4][4, 1]cpu"): V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] l_x_ = L_x_ V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True); l_x_ = None V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] return (to,) V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] ===== Forward graph 0 ===== I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"): I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True) I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True); arg0_1 = None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32); device_put = None I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] return (convert_element_type,) I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code: V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference'] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile() V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals()) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1, = args V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] args.clear() V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride(arg0_1, (3, 4), (4, 1)) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] with torch.cuda._DeviceGuard(0): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] torch.cuda.set_device(0) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] buf0.copy_(arg0_1, True) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del arg0_1 V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return (buf0, ) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10): V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._dynamo.testing import rand_strided V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import print_performance V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] fn = lambda: call([arg0_1]) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] return print_performance(fn, times=times, repeat=repeat) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__": V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] compiled_module_main('None', benchmark_compiled_module) V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] ``` See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513 Approved by: https://github.com/eellison	2024-10-01 00:32:47 +00:00
Shen Xu	19a4d68224	Add missing mappings to support torch.uint16 in quantization and export (#136547 ) Test Plan: CI. Differential Revision: D63142844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136547 Approved by: https://github.com/angelayi	2024-10-01 00:01:01 +00:00
eellison	18e707645c	Substitute unbacked symints in expressions (#137020 ) Differential Revision: [D63647095](https://our.internmc.facebook.com/intern/diff/D63647095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137020 Approved by: https://github.com/ezyang	2024-09-30 23:07:22 +00:00
PyTorch MergeBot	af64c44b56	Revert "Don't uselessly recompute axiom dict every static eval call (#135429 )" This reverts commit 1d6e0412f5205b1cd709e034526d7f21d6f2d56f. Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/ezyang due to try again ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2384288879))	2024-09-30 22:29:13 +00:00
Dan Zimmerman	c07ebaf430	[triton] Try to use triton.language.extra.libdevice when possible (#136997 ) Summary: X-link: https://github.com/facebookresearch/generative-recommenders/pull/90 In view of https://github.com/triton-lang/triton/pull/3825 we should try to use `triton.language.extra.libdevice` instead of `triton.language.extra.cuda.libdevice`. Test Plan: CI Reviewed By: bertmaher, karthik-man Differential Revision: D63583965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136997 Approved by: https://github.com/bertmaher	2024-09-30 21:52:44 +00:00
Dan Zimmerman	b3972ee19a	[triton] Unify build_paths.py for NV & AMD, fix typing (#136952 ) Summary: Some build improvements. Test Plan: CI Differential Revision: D63583959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136952 Approved by: https://github.com/bertmaher	2024-09-30 21:51:45 +00:00
PyTorch MergeBot	66a269afe8	Revert "Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 )" This reverts commit cf1a7eab250ea37ca8fda0327e8e38693c3c5c1a. Reverted https://github.com/pytorch/pytorch/pull/136934 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))	2024-09-30 21:44:44 +00:00
PyTorch MergeBot	c94536ae74	Revert "Format torch.fx.experimental.validator (#136935 )" This reverts commit 377e4bc877a3ac4cd6d073aa513a309159ade991. Reverted https://github.com/pytorch/pytorch/pull/136935 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))	2024-09-30 21:44:44 +00:00
PyTorch MergeBot	8982906502	Revert "Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 )" This reverts commit 3ff2d93d9f72fd26503ef0cf5c5956edad4c52e6. Reverted https://github.com/pytorch/pytorch/pull/136972 on behalf of https://github.com/ezyang due to need to back out for merge conflict ([comment](https://github.com/pytorch/pytorch/pull/136972#issuecomment-2384182244))	2024-09-30 21:35:08 +00:00
abhishek-fujitsu	b825848d85	Fix aarch64 debug build with GCC (#136990 ) Fixes #136440 Issue: When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations. Fix: Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations. The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt: - For aarch64 GCC debug builds: use -Og - For all other debug builds: retain -O0 This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged. Testing: Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-30 21:11:50 +00:00
Andrew Gu	866a64ce9a	[FSDP2] Added check for contiguous parameters (#137000 ) Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous. We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000 Approved by: https://github.com/weifengpy	2024-09-30 21:10:47 +00:00
PyTorch MergeBot	66e3186a48	Revert "Init threadpool with user defined num_threads before default (#136793 )" This reverts commit adbcaee950afa6697c04962096344bf0962a542f. Reverted https://github.com/pytorch/pytorch/pull/136793 on behalf of https://github.com/janeyx99 due to Caused internal Oculus crash, and internal force landed a diff without exporting to GH =.= ([comment](https://github.com/pytorch/pytorch/pull/136793#issuecomment-2384148132))	2024-09-30 21:10:12 +00:00
Nikita Shulga	bc6adb9596	[EZ][BE] Delete `ISSUE_TEMPALTE.md` (#137040 ) As it has been superseded by [ISSUES_TEMPLATE](https://github.com/pytorch/pytorch/tree/main/.github/ISSUE_TEMPLATE) folder, per https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository#creating-issue-forms Pull Request resolved: https://github.com/pytorch/pytorch/pull/137040 Approved by: https://github.com/ZainRizvi	2024-09-30 21:04:32 +00:00
Zain Rizvi	d46ebcb31b	Enable experiments for protected branches (#136785 ) This is to allow the protected branches (like `main` and `nightly`) also run on the LF fleet, now that we've migrated over Pull Request resolved: https://github.com/pytorch/pytorch/pull/136785 Approved by: https://github.com/jeanschmidt	2024-09-30 20:58:28 +00:00
PyTorch MergeBot	2ef1454189	Revert "Add int1 to int7 dtypes (#136301 )" This reverts commit bfa16a161d5089a9ba008f5e665f29b58dc16526. Reverted https://github.com/pytorch/pytorch/pull/136301 on behalf of https://github.com/PaliC due to causing internal failures ([comment](https://github.com/pytorch/pytorch/pull/136301#issuecomment-2384119600))	2024-09-30 20:50:49 +00:00
Howard Huang	0ccd39a64b	Fix prefix store seg fault (#136872 ) fixes https://github.com/pytorch/pytorch/issues/136723 Do not allow `None` to be passed into `PrefixStore` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136872 Approved by: https://github.com/kwen2501	2024-09-30 20:43:08 +00:00
Tom Ritchford	7b96f3c75d	Fix six broken tests in test_ops.py (#136653 ) ## The problem. [A commit from three weeks ago](`82d00acfee`) appears to have broken five tests but was not caught by CI. [A later commit](https://github.com/pytorch/pytorch/commit/e05ea2b1797) which added a decomposition of `transpose_copy` added another broken test, also seemingly not detected, making six total (listed below). They came to my attention when I updated some pending decomposition pull requests which passed CI, and started getting failures like [this](https://hud.pytorch.org/pr/134319) for a test unrelated to any of these pull requests, `TestCommonCPU.test_out__refs_transpose_copy_cpu_float32` Running `python test/test_ops.py -k _copy` on `viable/strict` found failures for six `_refs` ops: `copysign`, `expand_copy`, `index_copy`, `t_copy`, `transpose_copy`, `view_copy` ## The solution The original commit did actually cause breakage by slightly changing user-visible behavior (in a special case involving scalar tensors being copied between different devices). This pull request fixes that breakage in a reasonable way, but I don't understand why this error didn't appear in CI until I made later changes in the same area. ## To reproduce To reproduce the six cases in your own client: ``` PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=5 python test/test_ops.py TestCommonCPU.test_out__refs_view_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=2 python test/test_ops.py TestCommonCPU.test_out__refs_t_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_index_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/test_ops.py TestCommonCPU.test_out__refs_expand_copy_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_copysign_cpu_float32 PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=4 python test/test_ops.py TestCommonCPU.test_out__refs_transpose_copy_cpu_float32 ``` @amjames Pull Request resolved: https://github.com/pytorch/pytorch/pull/136653 Approved by: https://github.com/zou3519	2024-09-30 20:32:55 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
Laith Sakka	dfe1d45332	Enable tracing through auot_functionalized_v2 in compiled autograd (#136806 ) auto_functionalize_v2 will be the same as auto_functionalize except that args will have some more constants, or symints, and tensors are in one of the input list args. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136806 Approved by: https://github.com/zou3519	2024-09-30 19:16:13 +00:00
PyTorch MergeBot	1ef5d4cdde	Revert "Allow parallelize_module to get device_mesh from ambient context (#134247 )" This reverts commit 80e7478cc84919a48770ad85d6118294776fca73. Reverted https://github.com/pytorch/pytorch/pull/134247 on behalf of https://github.com/malfet due to Broke lint, which one can clearly see in PR CI https://github.com/pytorch/pytorch/actions/runs/11112138513/job/30873604386 ([comment](https://github.com/pytorch/pytorch/pull/134247#issuecomment-2383952449))	2024-09-30 19:07:01 +00:00
Nikita Shulga	4af03e54b7	[MPS][BE] Use `None` as alias for all types (#137004 ) Test like `new_` and `empty_` fail the current implementation, see Pull Request resolved: https://github.com/pytorch/pytorch/pull/137004 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986, #137003	2024-09-30 19:06:13 +00:00
Nikita Shulga	c610aa80dc	Testing: Unblock `new_*` testing on MPS (#137003 ) By changing `other_dtype` to `torch.half` rather than `double` in `sample_inputs_new_fns` if MPS is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986	2024-09-30 19:06:12 +00:00
Ke Wen	80e7478cc8	Allow parallelize_module to get device_mesh from ambient context (#134247 ) This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one. Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model. (The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.) For example: ``` class FeedForward(nn.Module): def __init__(self, config: TransformerArgs) -> None: super().__init__() w1 = nn.Linear(config.dim, config.hidden_dim, bias=False) w2 = nn.Linear(config.hidden_dim, config.dim, bias=False) w3 = nn.Linear(config.dim, config.hidden_dim, bias=False) self.w1 = parallelize_module(w1, Colwise) self.w2 = parallelize_module(w2, Rowwise) self.w3 = parallelize_module(w3, Colwise) def forward(self, x: Tensor) -> Tensor: y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x)) # y is a DTensor with Partial placement; we can return it as is. return y # Or we can convert it to Replicate -- there is modeling flexibility here. return y.redistribute(Replicate()) with device_mesh: model = FeedForward(config) # Now model is a model parallelized onto device_mesh y = model(x) ``` The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context. Calling `parallelize_module` from within model hierarchy also saves the use of FQNs as in the out-of-model annotation case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247 Approved by: https://github.com/tianyu-l	2024-09-30 18:42:06 +00:00
Nikita Shulga	40f80a70fa	Fix lint (#137023 ) By migrating some of the workflows to Python-3.9 as 3.8 has been deprecated by https://github.com/pytorch/pytorch/pull/132138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137023 Approved by: https://github.com/ZainRizvi, https://github.com/janeyx99, https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-09-30 18:29:02 +00:00
Quinn Zhu	d33638588e	[aoti][inplace] Support skipping model buffers (#136770 ) Summary: Some AOTI tensor constants may be model buffers that never needs to be updated. Differential Revision: D62777502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136770 Approved by: https://github.com/muchulee8	2024-09-30 18:28:42 +00:00
Edward Z. Yang	3ff2d93d9f	Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917, #136934, #136935	2024-09-30 18:04:36 +00:00
Sahan Paliskara	475a8a4e0c	Update ci-sev.md to make merge blocking not the default	2024-09-30 10:53:31 -07:00
Nikita Shulga	76a57568de	Update windows maintainers (#136901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136901 Approved by: https://github.com/albanD	2024-09-30 16:12:49 +00:00
Nikita Shulga	ae3d5ed589	[MPS] Enable `nan_to_num` for bfloat16 (#136986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136986 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985	2024-09-30 16:09:44 +00:00
Nikita Shulga	d8d3aeae59	[MPS] Enable Renorm for bfloat16 (#136985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136985 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984	2024-09-30 16:09:44 +00:00
Nikita Shulga	538fcd7579	[MPS] Enable `torch.linalg.cross` for bfloat16 (#136984 ) By adding explicit instantiation. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136984 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983	2024-09-30 16:09:40 +00:00
PyTorch MergeBot	c13c7e11c5	Revert "[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 )" This reverts commit 6931c1644afdba53e63ce5671455e4e1b7265dd9. Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its test_cpu_repro test is failing in trunk `6931c1644a` ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2383563919))	2024-09-30 15:47:04 +00:00
Nikita Shulga	33d3d6e42a	[MPS] Enable bucketization for bfloat16 (#136983 ) By simply adding explicit instantiation Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136983 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982	2024-09-30 14:45:57 +00:00
Nikita Shulga	3ed2969889	[MPS] Extend `fmin`/`fmax`/`copysign` and `nextafter` to blfoat (#136982 ) Just adds instantiation of the kernels and sometimes explicit cast. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136982 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981	2024-09-30 14:45:57 +00:00
Nikita Shulga	797092b263	[MPS] Fix Gamma for bfloat16 dtypes (#136981 ) Before this change, test failed with unable to compile errors, as `bfloat16` requires explicit cast. Tested in https://github.com/pytorch/pytorch/pull/136987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136981 Approved by: https://github.com/Skylion007	2024-09-30 14:45:52 +00:00
Bin Bao	a15f3f51bc	[AOTI] Update sam_fast from timeout to fail_to_run (#136996 ) Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996 Approved by: https://github.com/ezyang	2024-09-30 14:05:49 +00:00
Brian Hirsh	c010c6099b	compile time benchmarks for AOTDispatcher (partitioner) (#136760 ) compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760 Approved by: https://github.com/ezyang ghstack dependencies: #136670, #136759	2024-09-30 13:25:02 +00:00
Brian Hirsh	b17cd264d3	compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759 ) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759 Approved by: https://github.com/laithsakka ghstack dependencies: #136670	2024-09-30 13:25:02 +00:00
Brian Hirsh	dfdda2f6a6	inductor: use previous guards to know if a size is 1 for broadcasting (#136670 ) Fixes https://github.com/pytorch/pytorch/issues/136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670 Approved by: https://github.com/ezyang	2024-09-30 13:24:57 +00:00
cyy	05b15dba7e	[1/N] Fix clang-tidy warnings in torch/csrc/api/ (#134545 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134545 Approved by: https://github.com/ezyang	2024-09-30 09:06:30 +00:00
Bin Bao	d6d9183456	[Inductor] Switch cpp_wrapper tests to ABI-compatible (#136904 ) Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136904 Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78	2024-09-30 05:44:52 +00:00
Bin Bao	ad8fae2aa9	[AOTI] Support test_open_device_registration in ABI-compatible (#136906 ) Summary: Add a device type C shim interface to support test_open_device_registration in the ABI-compatible mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136906 Approved by: https://github.com/chenyang78	2024-09-30 05:08:16 +00:00
Aaron Gokaslan	8dddd45679	[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 ) Updates cudnn frontend submodule to v1.7.0 which has some bugfixes and a couple new features. https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.7.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136920 Approved by: https://github.com/ezyang	2024-09-30 02:50:16 +00:00
Thomas	80393c90b3	docs: clarify alias usage for `x` parameter in vector_norm function (#136921 ) - Added a note in the documentation specifying that the `input` parameter can be used as an alias for `x`. Fixes #136560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136921 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-30 02:50:06 +00:00
Edward Z. Yang	377e4bc877	Format torch.fx.experimental.validator (#136935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917, #136934	2024-09-30 02:20:40 +00:00
Edward Z. Yang	cf1a7eab25	Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934 Approved by: https://github.com/Skylion007 ghstack dependencies: #136917	2024-09-30 02:20:40 +00:00
xinan.lin	0a26851601	[Inductor] Handle device property `warp_size` is None but used on XPU. (#136834 ) Fix #136820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136834 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-09-30 02:08:45 +00:00
CaoE	6931c1644a	[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-30 00:53:18 +00:00
Edward Z. Yang	9dbc6bacff	Propagate detailed location information of shape guards to guards/recompiles output (#136917 ) To see the payoff, look at test/dynamo/test_logging.py The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations. The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917 Approved by: https://github.com/anijain2305	2024-09-30 00:43:12 +00:00
Laith Sakka	e205193e1c	Enable failing diffs on regression (#136551 ) 1. example of failing diff https://github.com/pytorch/pytorch/pull/136740 2. test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551 Approved by: https://github.com/ezyang ghstack dependencies: #136383	2024-09-29 22:31:26 +00:00
Jeff Daily	d33a5e2a57	[ROCm] fastSpecializedAtomicAdd for MI300 (#135770 ) MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135770 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-09-29 21:52:09 +00:00
fduwjj	c9653bf2ca	[Elasitc][fix] Use the right env variable TORCH_ELASTIC_WORKER_IDENTICAL for unit test (#136916 ) as title, this is an easy fix for unit test. Differential Revision: [D63577774](https://our.internmc.facebook.com/intern/diff/D63577774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136916 Approved by: https://github.com/wz337 ghstack dependencies: #136865	2024-09-29 03:55:10 +00:00
cyy	156ca01e51	Enable clang-tidy on torch/csrc/lazy (#136851 ) Enable clang-tidy on torch/csrc/lazy Pull Request resolved: https://github.com/pytorch/pytorch/pull/136851 Approved by: https://github.com/Skylion007	2024-09-28 21:16:40 +00:00
Aaron Gokaslan	d3c2123ea6	[BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686 ) * This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly. * Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels. * Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time. * These kernels and settings will be needed for Flash Attention 3 whenever we add that too. Fixes #133695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686 Approved by: https://github.com/ezyang	2024-09-28 21:11:15 +00:00
Edward Z. Yang	1d6e0412f5	Don't uselessly recompute axiom dict every static eval call (#135429 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429 Approved by: https://github.com/isuruf	2024-09-28 20:59:59 +00:00
FFFrog	6ecb73bafd	Limit the option value of TORCH_SHOW_DISPATCH_TRACE (#136510 ) It`s more convenient for user to enable or disable dispatch trace by setting TORCH_SHOW_DISPATCH_TRACE to 1 or 0, especially debug in IDE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136510 Approved by: https://github.com/shink, https://github.com/ezyang	2024-09-28 20:59:05 +00:00
Boyuan Feng	28224329ad	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-28 19:56:53 +00:00
Jason Ansel	cf53ab95dc	[halide-backend] Fix ops.fma codegen (#136810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136810 Approved by: https://github.com/eellison ghstack dependencies: #136808, #136809	2024-09-28 19:26:04 +00:00
Jason Ansel	8da9c4178c	[inductor] Benchmark Halide in operatorbench.py (#136809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809 Approved by: https://github.com/eellison ghstack dependencies: #136808	2024-09-28 19:26:04 +00:00
atalman	a54b69279b	Bump triton pin to latest 3.1.x release branch (#136874 ) Moves pin to latest in release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/136874 Approved by: https://github.com/bertmaher, https://github.com/drisspg, https://github.com/kit1980, https://github.com/malfet	2024-09-28 13:47:07 +00:00
Ivan Zaitsev	b35f70da05	[ez] fixup the export of D62879819 (#136900 ) a line from D62879819 (#136190) went missing somehow Pull Request resolved: https://github.com/pytorch/pytorch/pull/136900 Approved by: https://github.com/atalman	2024-09-28 13:46:17 +00:00
Banit Agrawal	c4ae45104f	[PyTorch Pinned Allocaor] Move background thread init from constructor to allocate function (#136879 ) Differential Revision: D63553138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136879 Approved by: https://github.com/zyan0	2024-09-28 07:24:44 +00:00
Jason Ansel	375921b755	[inductor] Improve operatorbench.py (#136808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808 Approved by: https://github.com/eellison	2024-09-28 06:22:02 +00:00
James Wu	96104db132	[easy] fix typo in debug logs for fx graph cache (#136889 ) Summary: Accidentally messed up the debug logging here, fixing typo (scuba + tlparse logging is unaffected) Test Plan: existing tests Differential Revision: D63555766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136889 Approved by: https://github.com/oulgen	2024-09-28 03:56:09 +00:00
Shivam Raikundalia	9e4f24f8e5	Fix PT2 Source Code Annotations (#136460 ) Summary: In D60803317, we added CompileContext (trace_id) information to Kineto traces using caching when a CompileContext exits. As pointed out by some users, this gives innaccurate IDs because we are not getting the context that we is being looked up within the eval_frame. For this reason, we decided to revert that change, and go with an approach that involves getting the trace_id associated with a given CacheEntry. To do this, we add a trace_id to the GuardedCode so that it can be passed onto a CacheEntry. Then, we change the lookup function to return said trace_id alongside the code so that we can pass both into our eval function. Once we get to a Torch-Compiled Region, we can just append the context information to the name of the annotation thus bypassing any need for kwargs. Test Plan: Added more comprehensive unit test. Saw that all the trace_ids appeared within the graph. Differential Revision: D63138786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136460 Approved by: https://github.com/ezyang	2024-09-28 03:54:43 +00:00
Nitin Jain	8df97d78c2	[QAT] Make Fused modules torchscriptable (#136285 ) Summary: Same as title. Inspired by: https://pytorch.org/tutorials/recipes/script_optimized.html#fix-common-errors-when-using-the-script-method Test Plan: CI Differential Revision: D62980019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136285 Approved by: https://github.com/jerryzh168	2024-09-28 03:46:19 +00:00
wz337	93dcb92bae	[DeviceMesh][EZ] Add group description to new group (#136558 ) Add group description to new_group in device_mesh to help with debuggability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136558 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2024-09-28 03:09:41 +00:00
Edward Z. Yang	99f90c379e	Simplify find_localzeros (#133325 ) Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325 Approved by: https://github.com/albanD	2024-09-28 02:38:31 +00:00
Jerry Zhang	bfa16a161d	Add int1 to int7 dtypes (#136301 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases for weight quantization (https://www.internalfb.com/diff/D62464487) Test Plan: python test/test_quantization.py -k test_uint4_int4_dtype Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136301 Approved by: https://github.com/ezyang	2024-09-28 02:08:33 +00:00
albanD	e4571e7025	Add abi flags to cpp_extension cache folder (#136890 ) This is to avoid cache confusion between normal vs pydebug vs nogil builds in cpp extensions which can lead to catastrophic ABI issues. This is rare today for people to run both normal and pydebug on the same machine, but we expect quite a few people will run normal and nogil on the same machine going forward. This is tested locally by running each version alternatively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136890 Approved by: https://github.com/colesbury	2024-09-28 00:49:56 +00:00
fduwjj	f42e88fea5	[reland][Elastic] Skip store barrier and store get in host assign (#136865 ) As title this is to reland https://github.com/pytorch/pytorch/pull/136579 as it broke some OSS CI Differential Revision: [D63542918](https://our.internmc.facebook.com/intern/diff/D63542918/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136865 Approved by: https://github.com/atalman	2024-09-27 23:40:42 +00:00
David Berard	ef3142d2a0	[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 ) In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time. However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871). This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686 Approved by: https://github.com/zou3519	2024-09-27 23:02:46 +00:00
Kunal Bhalla	9d67c31758	Cast device index to int before logging (#135405 ) int8_t = DeviceIndex is interpreted by cout as a char, which then shows up as a control character in logs (eg. ^A) etc. Explicitly casting to int to have the numbers printed out correctly. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135405 Approved by: https://github.com/wconstab	2024-09-27 23:01:12 +00:00
angelayi	fe158cfb47	[aoti] Add warning to ask users to switch to new API (#135893 ) Instead of the following: ``` so_path = torch._export.aot_compile(...) torch._export.aot_load(so_path) ``` The recommended path is to: ``` ep = torch.export.export(...) pt2_path = torch._inductor.aoti_compile_and_package(ep, ...) torch._inductor.package.load_package(pt2_path) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135893 Approved by: https://github.com/desertfire	2024-09-27 22:38:11 +00:00
Jane Xu	adbcaee950	Init threadpool with user defined num_threads before default (#136793 ) Fixes #134714 (or attempts to, idk how to test yet) For posterity, how one can test: 1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary 2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python 3. import torch 4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793 Approved by: https://github.com/albanD	2024-09-27 22:22:37 +00:00
Jesse Cai	bc21689136	[sparse][semi-structured] Add float8 dtype support to 24 sparsity (#136397 ) Summary: This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`. This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with cusparselt >= 0.6.2, via to `scaled_mm` API. ``` A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16) B = torch.rand(dense_input_shape, device=device).to(torch.float16).t() A_fp8, A_scale = to_float8(A) B_fp8, B_scale = to_float8(B) dense_result = torch._scaled_mm( A_fp8, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) A_fp8_sparse = to_sparse_semi_structured(A_fp8) sparse_result = torch._scaled_mm( A_fp8_sparse, B_fp8, scale_a=A_scale, scale_b=B_scale, out_dtype=out_dtype ) ``` Note that to keep this consistent with normal torch behavior, calling `torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError. I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the backend to make the tests a bit cleaner Test Plan: ``` python test/test_sparse_semi_structured -k scaled_mm python test/test_sparse_semi_structured -k fp8 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397 Approved by: https://github.com/drisspg	2024-09-27 21:37:34 +00:00
Oguz Ulgen	a28b40fa74	Improve is_fbcode functionality (#136871 ) Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof. Test Plan: unit tests Reviewed By: masnesral Differential Revision: D63545169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871 Approved by: https://github.com/masnesral	2024-09-27 21:19:01 +00:00
Nikita Shulga	283bda01aa	[MPS] Error checking/bf16 support for `torch.normal` (#136863 ) Before that attempt to run something like ``` % python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))" ``` Resulted in hard error ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` After the change, it raises a nice type error Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821, #136822	2024-09-27 21:11:59 +00:00
PyTorch MergeBot	f7ab0e9989	Revert "[Flex Attention] fix block size order (#136657 )" This reverts commit b42f1e3641314c8dc369255b850450acddf3477c. Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/ZainRizvi due to Sorry, this seems to break ROCm builds. inductor/test_flex_attention.py::TestFlexAttention::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_float16_score_mod1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/11069782242/job/30759299713) [HUD commit link](`b42f1e3641`) ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2380031525))	2024-09-27 20:47:54 +00:00
Yifu Wang	6e70ec9aa5	[SymmetricMemory] expose the multicast_ptr (#136840 ) This allows writing triton kernels using the `multimem` ptx instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136840 Approved by: https://github.com/Chillee	2024-09-27 20:41:33 +00:00
PyTorch MergeBot	f21b471978	Revert "Fix numerical instability for norm (#129352 )" This reverts commit 66340e67515cd3592bda6bdd9bfe2ffa22fe7413. Reverted https://github.com/pytorch/pytorch/pull/129352 on behalf of https://github.com/atalman due to Breaks Internal CI ([comment](https://github.com/pytorch/pytorch/pull/129352#issuecomment-2379989485))	2024-09-27 20:18:47 +00:00
Yifu Wang	d55eef5c59	[SymmetricMemory] improve multicast initialization/fallback logic (#136577 ) Fixes https://github.com/pytorch/pytorch/issues/136494 Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully. - In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device. - Before initializing multicast for a block, ensure all ranks/devices have multicast support. - This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block. - Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577 Approved by: https://github.com/Chillee, https://github.com/eqy	2024-09-27 20:04:21 +00:00
Wei Wang	e512eac410	Companion PR to https://github.com/pytorch/pytorch/pull/134022 (#136818 ) Note:[ cusparselt 0.6.0](https://docs.nvidia.com/cuda/cusparselt/release_notes.html#cusparselt-v0-6-0)+ supports SM90 (Hopper). Thanks @xwang233 for catching this bug while testing upstream binaries! Fixes the issues like: ``` A_compressed = torch._cslt_compress(A) RuntimeError: CUDA error: architecture mismatch when calling `cusparseLtInit(&handle)` ``` @kit1980 Could we get this cherry-picked to 2.5.0 please? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136818 Approved by: https://github.com/eqy, https://github.com/jcaip, https://github.com/malfet	2024-09-27 19:57:15 +00:00
Kimish Patel	e5a57932f0	[Pytorch][AO] Update choose_qparams_per_token op to output correct shape for scales and zp (#136807 ) - also makes scales and zp dtype reconcile with meta impl as well as other quantized ops representation of scales and zero point - make sure qunatize_per_token's output_dtype is respected There are a few places where we need to reconcile on scale and zero point dtype but that will come later. This fixes are mainly being done to enable quantized kv cache though ET stack Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807 Approved by: https://github.com/jerryzh168	2024-09-27 18:46:17 +00:00
Pian Pawakapan	6075f566cc	[export] simplify automatic dynamic shapes processing (#136591 ) Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620. This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`). On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. `7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)` One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked. Differential Revision: D63358628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591 Approved by: https://github.com/avikchaudhuri	2024-09-27 18:28:51 +00:00
Bob Ren	a8b5adcdd5	add types to _dynamo/code_context.py (#136665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136665 Approved by: https://github.com/williamwen42	2024-09-27 18:27:42 +00:00
PyTorch MergeBot	287dc36395	Revert "[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 )" This reverts commit 9f5b97a0065dfc4a7978a0fdf3fac2df8aef9519. Reverted https://github.com/pytorch/pytorch/pull/136686 on behalf of https://github.com/davidberard98 due to breaks lint on main. Please rebase to see and fix the error ([comment](https://github.com/pytorch/pytorch/pull/136686#issuecomment-2379830921))	2024-09-27 18:25:49 +00:00
Mikayla Gawarecki	2208ff64ba	Fix RMSNorm doc per #136597 (#136727 ) Fixes #136597 (remove incorrect sqrt around `RMS(x)`) <img width="857" alt="Screenshot 2024-09-26 at 11 46 32 AM" src="https://github.com/user-attachments/assets/21ea26ad-bd9f-4b9b-8b60-f52a1dc16da6"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136727 Approved by: https://github.com/albanD	2024-09-27 18:21:48 +00:00
William Wen	2157e396a3	[dynamo] attempt run only mode when dynamo cache limit is hit (#136655 ) Implement https://github.com/pytorch/pytorch/issues/135458. Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655 Approved by: https://github.com/jansel	2024-09-27 17:15:05 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
Davis Rollman	17f396b0b4	Delete project.default_flavors_mode buckconfig (#136772 ) Summary: Buck1 only buckconfig Test Plan: CI Reviewed By: JakobDegen Differential Revision: D63430482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136772 Approved by: https://github.com/malfet	2024-09-27 16:24:50 +00:00
cyy	cbc182d2e0	Remove problematic constructor (#136708 ) Since it calls a pure virtual function and it is not used elsewhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136708 Approved by: https://github.com/ezyang	2024-09-27 16:16:58 +00:00
James Wu	dc8c0aaf4d	[AOTAutogradCache] Log time taken_ns (#136529 ) Summary: This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry. This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events. Test Plan: Run benchmark, see that the times are in the chromium events. Reviewed By: aorenste Differential Revision: D62590077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529 Approved by: https://github.com/oulgen	2024-09-27 16:14:09 +00:00
David Berard	9f5b97a006	[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686 ) In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time. However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871). This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686 Approved by: https://github.com/zou3519	2024-09-27 16:11:02 +00:00
bhack	ad51995468	Add a nightly hotpatch utils for python only PR (#136535 ) I think this could help many teams, especially compile/export teams (/cc @ezyang), to let end user/bug reporters to quickly test WIP PR when reporting a related bug. This could quickly run in an official nightly Docker container or in a nightly venv/coda env. Let me know what do you think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136535 Approved by: https://github.com/ezyang	2024-09-27 15:58:48 +00:00
Nikita Shulga	9d72f7481b	[MPS] Fix AvgPool2d for float16 (#136822 ) This was a stupid cast error that caused MPSGraph to crash with the following exception ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821	2024-09-27 15:32:18 +00:00
Nikita Shulga	2b6f4e9e24	[BE][MPS] Delete MacOS12 low-precision ops (#136821 ) `norm` and `masked.normalize` still have to stay in the list Pull Request resolved: https://github.com/pytorch/pytorch/pull/136821 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755	2024-09-27 15:32:18 +00:00
Sam Larsen	45a8b5682e	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136858 ) This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing. Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858 Approved by: https://github.com/atalman	2024-09-27 15:14:12 +00:00
IvanKobzarev	34d788ffb0	[aotd] Do not force contiguous() for channels_last (#135225 ) Original Issue: https://github.com/pytorch/pytorch/issues/134644 We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing. => Tracing time: - Store trace_tangents_memory_formats in metadata - Coerce tangents to deduced memory_format Runtime: - Coerce tangents to tracing memory format from metadata Subclasses logic: - Previously coercing tangents logic did not handle nested subclasses case, fixing this. For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass: [subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ] If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure. The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method). Other small change: Remove duplicated not-related comment. Testing ``` python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous ``` Benchmarking: After change: ``` └─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms ``` Before change: ``` BEFORE_CHANGE SUBCLASS 4.1194 ``` No siginificant changes in processing time. (We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225 Approved by: https://github.com/bdhirsh	2024-09-27 15:01:20 +00:00
PyTorch MergeBot	de159f0c8d	Revert "Deal with size oblivious before going into worker (#135137 )" This reverts commit 285fa03b5e1540a52b354664f609f8576c5b5431. Reverted https://github.com/pytorch/pytorch/pull/135137 on behalf of https://github.com/ezyang due to this is the one that actually broke main ([comment](https://github.com/pytorch/pytorch/pull/135137#issuecomment-2379438566))	2024-09-27 14:41:27 +00:00
Justin Chu	1be3d62237	[ONNX] Remove unused functions (#136609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136609 Approved by: https://github.com/Skylion007	2024-09-27 14:34:05 +00:00
PyTorch MergeBot	e5228a7771	Revert "Don't uselessly recompute axiom dict every static eval call (#135429 )" This reverts commit 507c69e20f645fdb0fbf43b05be0c5117971464e. Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/malfet due to It(or it's parent) broke trunk CI, see `507c69e20f` ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2379422971))	2024-09-27 14:33:25 +00:00
Crefeda Rodrigues	a55aa71b04	Limit number of cores to 16 when benchmarking Inductor on ARM (#136424 ) Sets OMP_NUM_THREADS to 16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136424 Approved by: https://github.com/malfet	2024-09-27 14:22:49 +00:00
PyTorch MergeBot	e9d2765ec8	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit d1bb8e828f280d1c66fff193c043d5bc36154577. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2379214226))	2024-09-27 12:54:47 +00:00
Wu, Chunyuan	c2637a7b26	[inductor] [cpp] fix gemm_output_name conflict (#136419 ) Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case: ```log File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call C_ptr = f"&({kernel.index(C, [0, 0])})" File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index else self.args.input(node.get_name()) File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input assert name not in V.graph.removed_buffers, name AssertionError: buf_GemmOut ``` The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer. The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`. When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict. Rename these buffers by adding the name of the `template_buffer` as the prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418, #136518	2024-09-27 12:23:17 +00:00
Boyuan Feng	b42f1e3641	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-27 11:26:47 +00:00
IvanKobzarev	9581508383	[aotd] Cleanup on subclasses in inductor freezing (#136549 ) Cleanup: 1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper 2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat. Verfified fbcode tests compiled_optimizers Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549 Approved by: https://github.com/bdhirsh	2024-09-27 11:20:03 +00:00
cyy	bbff667e32	[Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713 ) Follows #136528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713 Approved by: https://github.com/kwen2501	2024-09-27 10:11:53 +00:00
Salman Mohammadi	48c18ff850	[dynamo] Added support for tensor's `is_inference` method (#136450 ) Fixes #135439 This PR adds support for the `is_inference` method on torch tensors which successfully compiles the following example fn without graph breaks: ```python def fn_simple(x): if x.is_inference(): return x.sum() else: return x.min() ``` I've also tried to add guards on the tensor to guard against `is_inference`. I wasn't 100% sure where these should go so please don't hesitate to correct me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136450 Approved by: https://github.com/ezyang	2024-09-27 09:15:07 +00:00
FFFrog	e14b58ffbd	Using device-agnostic autocast api (#136613 ) - using torch.autocast(device_str="cuda") instead of torch.cuda.amp.autocast() - using torch.autocast(device_str="cpu") instead of torch.cpu.amp.autocast() Pull Request resolved: https://github.com/pytorch/pytorch/pull/136613 Approved by: https://github.com/shink, https://github.com/cyyever, https://github.com/kwen2501	2024-09-27 07:16:24 +00:00
Howard Huang	ad6c70b656	[PP] Remove modifications to autograd nodes in ZB (#136678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136678 Approved by: https://github.com/wconstab, https://github.com/kwen2501 ghstack dependencies: #136507, #136584	2024-09-27 07:07:58 +00:00
hippocookie	9529d018e9	Refactor offset logic and work for nD (#135861 ) Optimize TODO task in code in distributed test files. - TODO: make this test cleaner and work for nD - TODO: add comments for create_plan/TestDedupTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/135861 Approved by: https://github.com/wz337	2024-09-27 06:13:06 +00:00
Nikita Shulga	69bd13d12e	[EZ][BE] Add `torch.complex` to MPS_DTYPES (#136755 ) As minimal supported OS has been rasied to MacOS 13, some basic complex operations should be supported, and the rest could be `xfailed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136755 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754	2024-09-27 05:01:40 +00:00
Laith Sakka	73f038c5b3	Log total miss inplaced bytes (#136684 ) Summary: title. Test Plan: add tests. run existing tests. Differential Revision: D63411459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136684 Approved by: https://github.com/zou3519	2024-09-27 04:57:57 +00:00
Oguz Ulgen	0200bea562	Delete grid reduction optimization that is causing specialization (#136783 ) Summary: https://fb.workplace.com/groups/1075192433118967/posts/1510513706253502 Creating a set is causing symexpr to specialize Test Plan: CI Reviewed By: ezyang Differential Revision: D63432357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136783 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-09-27 04:39:39 +00:00
Bob Ren	a63d7cb54c	add typing to _dynamo/current_scope_id.py (#136676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136676 Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/Skylion007	2024-09-27 04:09:15 +00:00
PyTorch MergeBot	5eb68d565f	Revert "[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 )" This reverts commit 2c5f5e303a8d6fd55b6651f4d965fafaa6a540a7. Reverted https://github.com/pytorch/pytorch/pull/136594 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136594#issuecomment-2378358302))	2024-09-27 04:06:05 +00:00
Edward Z. Yang	507c69e20f	Don't uselessly recompute axiom dict every static eval call (#135429 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429 Approved by: https://github.com/isuruf ghstack dependencies: #135137	2024-09-27 04:03:25 +00:00
Edward Z. Yang	285fa03b5e	Deal with size oblivious before going into worker (#135137 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135137 Approved by: https://github.com/isuruf	2024-09-27 04:03:25 +00:00
Blaine Burton Rister	86631eccda	[Inductor] Remove stride-0 dimensions from more complex block pointers (#135557 ) Related issue: #125077 ### Feature Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading. This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`. Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`). ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Before this PR, we would have stride-0 dimensions: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 8 xnumel = 8 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1]) tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :] tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` ### Test Plan Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.) This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly. Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557 Approved by: https://github.com/shunting314, https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-27 04:01:40 +00:00
Sam Larsen	2c5f5e303a	[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594 ) Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))(libdevice.sqrt((1 + ((ks0 // 3278)(ks0 // 3278)) + ((-2)(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594 Approved by: https://github.com/mengluy0125, https://github.com/jansel	2024-09-27 04:01:09 +00:00
Edward Z. Yang	a2d2a30311	Add torch._dynamo.config.fail_on_cache_limit_hit (#136767 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136767 Approved by: https://github.com/albanD, https://github.com/jansel ghstack dependencies: #136533	2024-09-27 03:58:00 +00:00
Mu-Chu Lee	2521cd3874	Skip kernel saving if already existed. (#136389 ) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels. Test Plan: Existing OSS CI [Redacted, Some internal model results in D63441430] Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/136389 Approved by: https://github.com/desertfire	2024-09-27 03:03:28 +00:00
Fuzzkatt	d1382aaf3d	skip test_out_of_memory for jetson (#133270 ) Skip test_out_of_memory in test/test_cuda.py on Jetson as OOM reporting in Jetson has issues due to partially missing NVML support. cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133270 Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/seemethere	2024-09-27 02:36:48 +00:00
Bin Bao	26869d38e1	[Inductor] Further solve missing aoti_torch_check symbole issue (#136775 ) Summary: https://github.com/pytorch/pytorch/pull/136669 didn't resolve all the internal test failures. Add more tests to OSS CI to catch the remaining issues, and fix some internal TARGETS dependency. Differential Revision: D63473744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136775 Approved by: https://github.com/henrylhtsang	2024-09-27 02:26:49 +00:00
CaoE	66340e6751	Fix numerical instability for norm (#129352 ) Fixes #123645 When the reduce size is large, reducing directly may exceed the range that FP32 can represent, resulting in incorrect results. Reducing in group and using double as the intermediate cumulative type can avoid exceeding the representation range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129352 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-27 00:51:31 +00:00
Sahan Paliskara	adc77a9b7f	[lintrunner] auto apply formatting changes as suggestions (#136239 ) (Remove spurious cc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136239 Approved by: https://github.com/huydhn, https://github.com/eqy Co-authored-by: Huy Do <huydhn@gmail.com>	2024-09-27 00:51:21 +00:00
Ruben Rodriguez Buchillon	faedee12fa	[test] enable test_triton_wrapper again (#136721 ) Summary: Reenable the `test_triton_wrapper.py` test again # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:triton_wrapper Differential Revision: D63438186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136721 Approved by: https://github.com/henrylhtsang	2024-09-27 00:44:40 +00:00
ankurneog	22a4129a76	Generalization of FSDP common for non-cuda execution (#133209 ) ## Motivation The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209 Approved by: https://github.com/kwen2501	2024-09-27 00:38:10 +00:00
Sergii Dymchenko	a619ced5ed	Revert "Update run_test.py" This reverts commit 193073b4914a7f80758541d391eacbe21194ecdf.	2024-09-26 17:34:52 -07:00
Sergii Dymchenko	193073b491	Update run_test.py	2024-09-26 16:56:29 -07:00
eellison	aa56f80ec1	Dont pairwise check unfusable nodes in scheduler (#136682 ) Gives 8% wall time speedup on n=1000 benchmark in https://github.com/pytorch/pytorch/pull/136429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136682 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/shunting314	2024-09-26 23:46:52 +00:00
Nikita Shulga	0b62ebfeaa	[CI] Populate `JOB_ID` for MPS tests (#136791 ) Move `get-job-id` steps before running the tests and copy-n-paste environment variables from `_mac-test.yml` added in https://github.com/pytorch/pytorch/pull/113099 Should fix the following warning during MPS test run: ``` /Users/ec2-user/runner/_work/pytorch/pytorch/tools/stats/upload_metrics.py:147: UserWarning: Not emitting metrics for td_test_failure_stats_v2. Missing job_id. Please set the JOB_ID environment variable to pass in this value. warn(f"Not emitting metrics for {metric_name}. {e}") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136791 Approved by: https://github.com/albanD, https://github.com/izaitsevfb	2024-09-26 23:00:52 +00:00
Bin Bao	da5c7b6f4e	[AOTI] Set CUDA device for torch._export.aot_load (#136715 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/136369. When a CUDA device with index is specified when calling torch._export.aot_load, we need to specify the CUDA device when running model.so. Differential Revision: [D63438335](https://our.internmc.facebook.com/intern/diff/D63438335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136715 Approved by: https://github.com/angelayi	2024-09-26 22:20:12 +00:00
Joel Schlosser	991f8f8ec3	Bias gradient calculation for NJT linear backward (#136660 ) Previously NYI - @mikaylagawarecki needs it for Transformers. Fixes #136652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660 Approved by: https://github.com/soulitzer	2024-09-26 21:38:10 +00:00
eqy	c0e98a485b	[FP8][CUDA] Fix stale expected error message (#136581 ) CC @nWEIdia as I think we have seen internal failures on this Pull Request resolved: https://github.com/pytorch/pytorch/pull/136581 Approved by: https://github.com/mikaylagawarecki	2024-09-26 20:57:38 +00:00
Roy Hvaara	5789f8d5dc	[MPS] Add regression test for large inputs to `F.linear` (#136084 ) This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13. ~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-26 20:46:14 +00:00
Sergii Dymchenko	9656a603b2	Fix lint (#136781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136781 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet	2024-09-26 19:13:56 +00:00
Sergii Dymchenko	c878ea2c4e	Add info about "release tracker" label for cherry-picking bot (#136777 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136777 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-09-26 18:45:45 +00:00
Jithun Nair	851b9732aa	Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603 ) PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](`a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)`) docker images. Building aotriton from source can take ~45 minutes. This PR fixes the issue by downloading the aotriton tarball in such scenarios, unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603 Approved by: https://github.com/atalman Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>	2024-09-26 18:05:51 +00:00
Pian Pawakapan	f0a92541fe	[export] fix lifted constants order for 0-input graphs (#136658 ) Summary: With empty graphs, the `graph.inserting_before(first_user_input = None)` call turns into a `graph.inserting_after(root)` call, inverting the order of constant input nodes being inserted. This fixes the issue by initializing to the first node in the graph (still valid if not a user input - only used for insertion). Test Plan: test_export Differential Revision: D63403514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136658 Approved by: https://github.com/avikchaudhuri	2024-09-26 17:44:24 +00:00
fduwjj	40c825d773	[reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768 Approved by: https://github.com/kwen2501, https://github.com/atalman	2024-09-26 17:37:07 +00:00
Rachel Guo	da09984c0d	[AOTI][Tooling][9/n] Add debug printer support for cpp kernel type (#136465 ) Summary: As title. Cpp kernel has a different codegen path: https://www.internalfb.com/code/fbsource/[6df946858879dd9bcefa18710dd79095a957f0dd]/fbcode/caffe2/torch/_inductor/codegen/cpp.py?lines=4643 Previously it is not wrapped/supported by the debug printer manager. This diff adds this support. It can be useful for cpu models. See this for a use case: https://www.internalfb.com/phabricator/paste/view/P1598561051?lines=927 Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run 'fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn -- aot --batch-size 1 ``` Differential Revision: D63053101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136465 Approved by: https://github.com/hl475	2024-09-26 17:30:43 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	e4e83a4ac4	Remove aten.item hack (#136663 ) Summary: Title Test Plan: CI Differential Revision: D63404353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136663 Approved by: https://github.com/bdhirsh	2024-09-26 17:14:48 +00:00
albanD	2421344d8f	Update current maintainers (#136672 ) This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info. The main rules we followed: - No code contributor is removed, they're all placed as emeritus - Breakdown too big categories to make this document useful to know who to ping - No category where the code is still in the codebase is removed - We did not rework the categories (for example to be closer to module: labels) and leave that for later - All non-emeritus names are ordered by their number of comments on issues related to their topic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet	2024-09-26 17:13:16 +00:00
Edward Z. Yang	beb46de342	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #136599	2024-09-26 16:50:13 +00:00
Edward Z. Yang	11fd55827d	Make CLOSURE_VARS construction lazy (#136599 ) This makes us less likely to hit import cycle problems with torch Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136599 Approved by: https://github.com/anijain2305	2024-09-26 16:50:13 +00:00
drisspg	ff2360c733	[FlexAttention] Reduce expensive test time by 10x (#136677 ) Now that we support non 128 divisble sequence lengths; drops expensive tests by like 10x Before ```Shell 46.32s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 45.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 44.45s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 43.61s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 ``` After: ```Shell 4.25s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod5 4.20s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod4 4.19s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1 4.04s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2 3.99s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0 3.98s call test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3 ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136677 Approved by: https://github.com/Chillee ghstack dependencies: #136673	2024-09-26 16:40:21 +00:00
drisspg	840c6b7a68	[FlexAttention] Add Better error message for cpu tensors (#136673 ) Partially address: #136525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673 Approved by: https://github.com/Chillee	2024-09-26 16:40:21 +00:00
Thanh Ha	ddab704b28	Use wildcard for portion of AMI version number (#136764 ) Rather than specifying a specific version number for the AMIs, use wildcards for the date section. Issue: pytorch/pytorch#136762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136764 Approved by: https://github.com/ZainRizvi	2024-09-26 16:39:25 +00:00
cyy	59e8f8228f	[3/N] Fix clang-tidy warnings in torch/csrc/lazy (#136705 ) Follows #136634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136705 Approved by: https://github.com/Skylion007	2024-09-26 16:29:43 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Nikita Shulga	68579ef665	[EZ][MPS] Extend `arange` to bfloat16 (#136754 ) RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES` Fixes https://github.com/pytorch/pytorch/issues/136624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754 Approved by: https://github.com/Skylion007	2024-09-26 15:33:45 +00:00
Nikita Shulga	73ec76ed50	[MPS] Implement `isposinf` and `isneginf` (#136689 ) Not sure, why `isinf` is a composite op, but those needs to be implemented by hand. Implementation is a trivial call to ```objc [mpsGraph equalWithPrimaryTensor:input secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity() dataType:input.dataType]] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689 Approved by: https://github.com/Skylion007	2024-09-26 15:33:20 +00:00
drisspg	d05645841e	Update get_device_properties to take in optional device (#136683 ) Aligns behavior with the rest of cuda's device info query methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683 Approved by: https://github.com/eqy	2024-09-26 15:07:31 +00:00
PyTorch MergeBot	d5e4a20c17	Revert "Introduce _ArglessActivation base class for parameterless activation functions (#136296 )" This reverts commit dda0e4de32b29098f25f9b2889423c9446680cc1. Reverted https://github.com/pytorch/pytorch/pull/136296 on behalf of https://github.com/atalman due to Breaks Internal CI. Error: Too many arguments [19]: Call `nn.modules.activation._ArglessActivation.__init__` expects 0 positional arguments, 1 was provided. ([comment](https://github.com/pytorch/pytorch/pull/136296#issuecomment-2377091280))	2024-09-26 14:12:12 +00:00
Joel Schlosser	4150ab44a4	Fix composite op redispatch for NJT in inference mode (#134683 ) Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT. This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683 Approved by: https://github.com/soulitzer, https://github.com/albanD ghstack dependencies: #136566	2024-09-26 14:10:53 +00:00
Joel Schlosser	f8debd5d83	Fix wrapper subclass reentrant dispatch + TorchDispatchMode (#136566 ) Fixes #136565 This PR makes the python fallback robust to the case where there are no active modes & no tensors with the Python key. In this case, simply redispatch with the Python key disabled. This was found when trying to use reentrant dispatch for NJT to get decompositions under `inference_mode()` when the autograd key is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136566 Approved by: https://github.com/bdhirsh	2024-09-26 14:06:51 +00:00
leslie-fang-intel	963e793e1b	[Inductor][CPP] Optimize WOQ INT8 wgt dequant in AMX GEMM template (#136630 ) Summary Optimize the WOQ int8 AMX performance by changing the int8 -> bf16 conversion. Earlier, 16 int8 elements were being loaded at a time & converted to 16 BF16 elements. With this change, 32 int8 elements will be loaded at a time, and converted to a cache-line of 32 BF16 elements more efficiently. Performance before ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 38.0439 ms 100.0% _weight_int8pack_mm 50.2524 ms 75.7% SingleProcess AUTOTUNE benchmarking takes 1.1087 seconds and 1.9791 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 78.2038 ms 100.0% _weight_int8pack_mm 119.1962 ms 65.6% SingleProcess AUTOTUNE benchmarking takes 1.9274 seconds and 1.9949 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 79.2368 ms 100.0% _weight_int8pack_mm 118.3212 ms 67.0% SingleProcess AUTOTUNE benchmarking takes 1.9200 seconds and 2.0015 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 225.7201 ms 100.0% _weight_int8pack_mm 388.5588 ms 58.1% ``` Performance after this PR ``` AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096) cpp_packed_gemm_0 11.0086 ms 100.0% _weight_int8pack_mm 50.2918 ms 21.9% SingleProcess AUTOTUNE benchmarking takes 1.0837 seconds and 2.0301 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008) cpp_packed_gemm_4 24.3528 ms 100.0% _weight_int8pack_mm 119.8492 ms 20.3% SingleProcess AUTOTUNE benchmarking takes 1.8303 seconds and 1.8195 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096) cpp_packed_gemm_6 24.6148 ms 100.0% _weight_int8pack_mm 119.1908 ms 20.7% SingleProcess AUTOTUNE benchmarking takes 1.8315 seconds and 1.8352 seconds precompiling AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000) cpp_packed_gemm_224 78.1369 ms 100.0% _weight_int8pack_mm 387.6289 ms 20.2% SingleProcess AUTOTUNE benchmarking takes 4.5059 seconds and 1.8010 seconds precompiling ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136630 Approved by: https://github.com/jgong5 ghstack dependencies: #136353	2024-09-26 08:41:58 +00:00
Menglu Yu	77fba0c407	[PT2][Optimus] Fix a group batch fusion corner case (#136650 ) Summary: We have a user report on BA model that it raised "AttributeError: 'SymFloat' object has no attribute 'shape'", thus we add type check for the meta node. See more context in the post https://fb.workplace.com/groups/1075192433118967/permalink/1510477489590457/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-batch-decompose --flow_id 646303196 ``` P1609807876 # E2E before fix f646303196 after fix Differential Revision: D63399959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136650 Approved by: https://github.com/ezyang	2024-09-26 06:35:11 +00:00
Kurt Mohler	d1bb8e828f	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-26 04:52:05 +00:00
PyTorch MergeBot	b408591b53	Revert "[Flex Attention] fix block size order (#136657 )" This reverts commit 529b6ab0bb9f8800ed795ec8e4fa1f0e8042bb0a. Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some test_flex_attention is failing in trunk after this change `529b6ab0bb` ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2375824802))	2024-09-26 04:06:41 +00:00
cyy	3c542ce831	[Reland] Check function declarations of COREML code (#136070 ) Reland of #135467 by fixing periodic workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136070 Approved by: https://github.com/ezyang	2024-09-26 03:52:06 +00:00
Roy Hvaara	042af7ec53	[BE] [MPS] Use validation helper for input tensors (#134609 ) Small refactor to use already existing helper with equivalent behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134609 Approved by: https://github.com/malfet	2024-09-26 03:47:30 +00:00
rzou	e4d32d2194	Improve data-dependent-output meta kernel error message (#136671 ) Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/136671 Approved by: https://github.com/williamwen42	2024-09-26 03:46:04 +00:00
xinan.lin	190e09d8b6	[Inductor UT] Generalize device-bias code introduced from #134874 and (#136596 ) [Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases. Fix #136595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596 Approved by: https://github.com/EikanWang, https://github.com/jansel Co-authored-by: Yu, Guangye <guangye.yu@intel.com>	2024-09-26 02:56:59 +00:00
eugenekoran	dda0e4de32	Introduce _ArglessActivation base class for parameterless activation functions (#136296 ) Fixes #133683 Fixes #133684 Fixes #133688 This PR introduces a new base class `_ArglessActivation` and refactors five existing activation functions to inherit from it. This change aims to improve documentation consistency and also API consistency with other activation functions that do have parameters and explicitly call `super().__init__()` Key changes and considerations: 1. Added new class `_ArglessActivation`: 2. Refactored the following classes to inherit from `_ArglessActivation`: - Sigmoid - Tanh - Softsign - Tanhshrink - Softmax2d 3. Performance consideration: - This change introduces a slight overhead for creating a new stack frame and handling an additional function call on every instance creation - The impact is expected to be minimal in most use cases Docs view before: <img width="425" alt="Screen Shot 2024-09-18 at 3 00 22 PM" src="https://github.com/user-attachments/assets/ca0d1000-44c5-4c52-b344-68f7e170bafe"> Docs view after: <img width="431" alt="Screen Shot 2024-09-18 at 3 00 52 PM" src="https://github.com/user-attachments/assets/f7ceb8f3-a2a2-4fd6-a2b8-39105a02bcbd"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136296 Approved by: https://github.com/mikaylagawarecki	2024-09-26 02:45:05 +00:00
rzou	d0456b4274	noop on torch.library APIs under torch::deploy (multipy) (#136645 ) Fixes https://github.com/pytorch/pytorch/issues/136177 The motivation is that torch::deploy doesn't handle this well. The workaround for users is to use C++ custom ops. All torch.library APIs ultimately go through the torch.library.Library object, so we add checks to noop for torch::deploy there. Test Plan: - new test - going to test this internally and hope nothing breaks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136645 Approved by: https://github.com/ezyang	2024-09-26 02:34:34 +00:00
Bin Bao	5c78c6b05a	[CI] Switch aarch64 dashboard run back to nightly (#136643 ) Summary: Reduce the frequency of the aarch64 dashboard CI run since we don't need to monitor its instability anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136643 Approved by: https://github.com/huydhn	2024-09-26 01:26:05 +00:00
Howard Huang	141cae2eb8	[pipelining] Fix more leaks and check leaks in tests (#136584 ) Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details). This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress. Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles. Uses objgraph for a nice debug utility when a leak is found. Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak. I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker. Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py, and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`: ``` warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle? warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes) Graph viewer (xdot) not found, generating a png instead Image generated as /tmp/objgraph-ztz642h3.png ``` rendering of ` /tmp/objgraph-ztz642h3.png`: <img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584 Approved by: https://github.com/kwen2501, https://github.com/H-Huang ghstack dependencies: #136507 Co-authored-by: Howard Huang <howardhuang@fb.com>	2024-09-26 01:10:40 +00:00
Nichols A. Romero	e8f1dd6ba0	Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (#136283 ) Fixes #131701 Use CMake imported targets more consistently to eliminate hardcode paths. Here is the new relevant sections of Caffe2Targets.cmake: ``` set_target_properties(c10_hip PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64" ) ``` ``` set_target_properties(torch_hip PROPERTIES INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL" INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS" INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver" ) ``` HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-26 00:34:43 +00:00
Zheng, Zhaoqiong	f3dd1721f4	[Update] Update note for Getting Started with PyTorch on Intel GPUs (#129946 ) remove the hardware and software prerequisites and set up env part. keep the prerequisites section and link to pytorch prerequistes for intel gpus for driver install, intel support package install and env set up https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html Update the support for Intel Client GPU MTL-H Update inference & training examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/129946 Approved by: https://github.com/seemethere	2024-09-26 00:22:05 +00:00
PyTorch MergeBot	9223c16208	Revert "Fix constant propagation in builtins and UserClasses (#131354 )" This reverts commit dd4a51b39aa02cba23b3a387b41c5026770d9220. Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/atalman due to Breaks torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2375417145))	2024-09-25 23:01:03 +00:00
Bin Bao	ecc15c4f89	[AOTI] Fix a missing aoti_torch_check symbol issue (#136669 ) Summary: When Inductor generates cpp kernels, they should be pure cpp loops which are independent to libtorch as much as possible. Differential Revision: D63403473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136669 Approved by: https://github.com/henrylhtsang	2024-09-25 22:56:10 +00:00
Huy Do	b7a5c7d331	Do not XFAIL test_segfault in fbcode (#136661 ) https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661 Approved by: https://github.com/malfet	2024-09-25 22:26:24 +00:00
ratnampa	8d65d9f11b	Constraint setuptools to 72.1.0 or older in requirements.txt (#136489 ) FIXES: https://github.com/pytorch/pytorch/issues/136541 Setuptools>=74.0.0 has deprecated support for some functions in distutils, and so the builds run into error such as ```AttributeError: module 'distutils' has no attribute '_msvccompiler'```. Also, the pytorch builds have setuptools pin to 72.1.0 according to these PRs: https://github.com/pytorch/builder/pull/1995 and `89d9a8cf6f`. So, until there is a fix to change the function usage in accordance with latest setuptools, the 72.1.0 version works fine. Also observed in CI jobs: https://github.com/pytorch/pytorch/actions/runs/10979326524 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136489 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 22:06:05 +00:00
Xuan Zhang	c9d12f6360	[inductor][memory] add signpost event for memory pass (#136538 ) Add logging to scuba table for internal models. For verification, I triggered a sample workflow internally and checked the scuba table logging to make sure the `Paramaters` column has the expected loggings, see [here](https://fburl.com/scuba/workflow_signpost/39h7qo9s). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136538 Approved by: https://github.com/yf225	2024-09-25 21:47:46 +00:00
rzou	b5c2a657ae	Add zou3519 to CODEOWNERS for HOPs (#136679 ) There are some tricky things that I want to guard against Pull Request resolved: https://github.com/pytorch/pytorch/pull/136679 Approved by: https://github.com/Chillee	2024-09-25 21:29:48 +00:00
Animesh Jain	289df45cee	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" (#136590 ) This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverts * https://github.com/pytorch/pytorch/pull/135503 * https://github.com/pytorch/pytorch/pull/135502 * https://github.com/pytorch/pytorch/pull/135422 This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation. ``` import torch from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention from torch._inductor.lowering import make_pointwise, register_lowering from torch._inductor.virtualized import ops from torch.nn.attention.flex_attention import create_block_mask torch.set_default_device('cuda') flex_attention = torch.compile(flex_attention, dynamic=False) prefix_lengths = torch.arange(8) def prefix_lm(b, h, q, kv): return prefix_lengths[b] >= kv mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590 Approved by: https://github.com/Chillee	2024-09-25 21:10:43 +00:00
Boyuan Feng	529b6ab0bb	[Flex Attention] fix block size order (#136657 ) `create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`. This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-09-25 21:08:40 +00:00
Edward Yang	76b044d7cb	Don't actually import module when checking if its valid (#136548 ) Summary: If you actually import the module, you might end up with some import cycle situation where a module is imported too early and accesses things that are not initialized yet. Test Plan: sandcastle and ossci ``` TORCH_LOGS=+torch._inductor.codecache buck run mode/opt caffe2/benchmarks/dynamo:torchbench ``` Differential Revision: D63330224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136548 Approved by: https://github.com/Skylion007	2024-09-25 20:47:32 +00:00
atalman	11c5f9ac3b	Use amazon linux 2023 runners for Docker builds (#136544 ) Migrate these builds to linux 2023. We want to build and test the Docker images in CD. Looks like we are hitting this issue: https://github.com/docker/buildx/issues/379 when trying to build Docker on Amazon Linux 2023. Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544 Proposed Solution is to fix it in user_data . Please see: https://github.com/pytorch/test-infra/issues/5712 I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544 Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136544 Approved by: https://github.com/ZainRizvi Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-25 20:39:56 +00:00
Xinran / Allan Rui	13b0baf2a1	[FX] Update _inline_module util function to work with both args and kwargs (#136631 ) Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found. Test Plan: ``` buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions ``` Differential Revision: D63347675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631 Approved by: https://github.com/jfix71	2024-09-25 20:20:57 +00:00
Sunishchal Dev	a8ed873ba2	Add missing input "eps" to adam docs (#135191 ) Minor fix for missing input argument in the Adam optimizer docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191 Approved by: https://github.com/janeyx99	2024-09-25 20:17:23 +00:00
cyy	6aa6bd4ca5	[Distributed] [12/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136528 ) Follows #136439. A dangling reference to qualifiedName was found and fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136528 Approved by: https://github.com/kwen2501	2024-09-25 20:12:08 +00:00
Xiaozhu Meng	5a29a06aa3	[AMD][inductor] do not use float64 on AMD internally (#136441 ) Summary: Internal AMD triton seems to have issue with float64 constant: ``` ### Most recent error lines found on the logs: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] ^ E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp8 = tl.broadcast_to((libdevice.llrint((tl.full([1], 1.00000000000000, tl.float64))(ks3.to(tl.float64)))) / ks1, [XBLOCK, RBLOCK]) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp7 = tmp5 + tmp6 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp6 = 0.5 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp5 = tmp4.to(tl.float32) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp4 = (((r3 + (x0((17 + (16ks0ks1)) // 18))) % ks2) // ks0) % ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp3 = tmp2.to(tl.int1) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp2 = tmp0 < tmp1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp1 = 16ks0ks1 E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] tmp0 = r3 + (x0((17 + (16ks0*ks1)) // 18)) E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] r3 = rindex E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rmask = rindex < rnumel E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] rindex = roffset + rbase E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] triton.compiler.errors.CompilationError: at 26:15: E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns) ``` Bisecting showing this error introduced by D62465575 This diff tries to not convert constant to float64 on AMD, and emu1.4 predictor now can run on AMD with rocm6.0. Test Plan: rocm6.0 can work ``` TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE=1 HIP_FORCE_DEV_KERNARG=1 HIP_GRAPH=--use-cuda-graph PYTORCH_MIOPEN_SUGGEST_NHWC=1 TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 CUDA_VISIBLE_DEVICES="2" TORCH_LOGS="recompiles,cudagraphs" buck2 run @//mode/opt-amd-gpu -c fbcode.rocm_ck_rtz=true -m rocm60 fblearner/predictor/py/applications/photogen:ip_python_predictor_photogen_cm -- --model=photogen_v1p4_9b --thrift_server_port=15008 --max_predict_calls=1 --enable_tunable_op --load_from_torch_package=genai:937233660_1 ``` emu1.4 predictor on AMD fails with rocm6.2 with some other triton errors (https://www.internalfb.com/phabricator/paste/view/P1603842354) Differential Revision: D63263806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136441 Approved by: https://github.com/houseroad	2024-09-25 19:13:17 +00:00
Zain Rizvi	37f340c1e5	[EZ] Remove remaining amz2023 runner variant references (#136540 ) Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it Explicit references to the amz2023 runner type variants were removed in the following PRs: - https://github.com/pytorch/ignite/pull/3285 - https://github.com/pytorch/ao/pull/887 - https://github.com/pytorch/fbscribelogger/pull/1 - https://github.com/pytorch/pytorch/pull/134355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-25 19:01:00 +00:00
David Berard	9c2c61d2dd	[inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472 ) AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this. Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472 Approved by: https://github.com/jansel	2024-09-25 18:21:09 +00:00
cyy	a259fbf72c	[2/N] Fix clang-tidy warnings in torch/csrc/lazy (#136634 ) Follows #134655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136634 Approved by: https://github.com/Skylion007	2024-09-25 18:08:29 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	0b38fa154a	Fix meta registry in export (#136492 ) Summary: Title Test Plan: CI This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp Differential Revision: D63297050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492 Approved by: https://github.com/bdhirsh	2024-09-25 17:53:02 +00:00
Justin Chu	8582835499	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre, https://github.com/cyyever Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-25 17:44:18 +00:00
Edward Z. Yang	7cb6d31567	Dump partially traced make_fx graph in event of error to tlparse (#136508 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136508 Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/malfet ghstack dependencies: #136533	2024-09-25 17:44:15 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9409274bc1	Fix bug in functional tensor decomp (#136600 ) Summary: Previously we had a very bad bug where we don't allow any decomp on CIA. This never mattered before because we never had to actually push CIA decomp to Python key level in export. Test Plan: CI Differential Revision: D63363749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136600 Approved by: https://github.com/bdhirsh	2024-09-25 17:37:50 +00:00
David Berard	5d7ed02f52	[user-written triton kernels] specialize exprs if they are expected to be tl.constexpr (#136512 ) Fixes #136504 If you have a tl.constexpr parameter to a triton kernel, and you pass in a SymNode, then, right now, you run into failures (see under 'constants'): ``` File "/tmp/torchinductor_dberard/na/cnax67r5zmslz7bvdfizteaepj7fajpjallb3bu2gyetjcdqtbzj.py", line 14, in <module> triton_meta={'signature': {0: 'fp32', 1: 'fp32'}, 'device': DeviceProperties(type='cuda', index=0, cc=90, major=9, regs_per_multiprocessor=65536, max_threads_per_multi_processor=2048, multi_processor_count=132, warp_size=32), 'constants': {2: s0, 3: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]}, torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: NameError: name 's0' is not defined ``` To fix this, we specialize on the value during dynamo tracing, so that we have a real integer when we do codegen. Alternatives: specialize somewhere else (e.g. inductor); or figure out how to actually pass the value dynamically into the user-written kernel. However, if we try to pass a dynamic value, then we wouldn't be able to precompile the triton kernels in inductor or use AOTI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136512 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/eellison	2024-09-25 17:12:11 +00:00
Pian Pawakapan	7c6d543a5b	[export] fix _get_non_persistent_buffers for duplicates (#136552 ) Summary: Export's method _get_non_persistent_buffers doesn't check duplicate submodules, so we run into state_dict related issues if non-persistent buffers exist on shared submodules. Test Plan: test_export Differential Revision: D63332976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136552 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2024-09-25 16:46:31 +00:00
Sahan Paliskara	aa80b82cea	[hygiene] Delete dead alerting code (#136583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136583 Approved by: https://github.com/clee2000	2024-09-25 15:44:46 +00:00
Sergii Dymchenko	0232278b33	Fix comment posting permissions for check-labels.yml (#136610 ) Currently it fails with Error fetching https://api.github.com/repos/pytorch/pytorch/issues/136607/comments HTTP Error 403: Forbidden (see https://github.com/pytorch/pytorch/actions/runs/11026434368/job/30622960113?pr=136607) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136610 Approved by: https://github.com/malfet	2024-09-25 15:43:19 +00:00
Huy Do	34711fe8c9	Fix test_skip_data_serialization pickle exception match (#136617 ) The test is failing in trunk atm with the following error: ``` test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False - AssertionError: "Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'" does not match "Can't get local object 'WeakValueDictionary.__init__.<locals>.remove'" ``` for example, `36f0e61166` This comes from this cpython commit `a3076c734d`, and manifests in python 3.12.5 currently used in CI. The failure doesn't happen when I try it out with 3.12.3 and 3.12.4. Looking at the commit logs of https://github.com/python/cpython/commits/main/Lib/pickle.py, it looks like the exception message is changing back and forth, so I guess a regex match would capture both.	2024-09-25 08:35:46 -07:00
Catherine Lee	deb820602a	viable/strict update: log push to s3 (#136470 ) As stated in https://github.com/pytorch/test-infra/pull/5686, I cannot figure out a way to determine the push time from webhooks (other than when the webhook was sent, but that isn't super accurate either). Instead, manually save a json file to s3 that contains information for the sha and date so that we can still get this information. Relies on https://github.com/pytorch/test-infra/pull/5690 tested in https://github.com/pytorch/pytorch/pull/136387 (but I squashed so it's kinda hard to find now) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136470 Approved by: https://github.com/huydhn	2024-09-25 15:28:53 +00:00
PyTorch MergeBot	e3b89ca124	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit b1a02bf70824a4802411ddd5be1d3610e7a2e269. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626))	2024-09-25 14:11:01 +00:00
Bin Bao	20a855bf01	[AOTI] Move stack_allocation logic from PythonWrapperCodegen (#136463 ) Summary: Move stack_allocation logic from PythonWrapperCodegen to CppWrapperCpuArrayRef Differential Revision: [D63319970](https://our.internmc.facebook.com/intern/diff/D63319970) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136463 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461, #136462	2024-09-25 14:06:33 +00:00
PyTorch MergeBot	5171b0e3c6	Revert "[ONNX] Remove the operators test (#136335 )" This reverts commit 9629835b1ccce8e72fc93bf95be13e3d53cb4871. Reverted https://github.com/pytorch/pytorch/pull/136335 on behalf of https://github.com/ezyang due to I'll reland this, bear with me ([comment](https://github.com/pytorch/pytorch/pull/136335#issuecomment-2374183435))	2024-09-25 14:06:03 +00:00
Bin Bao	070952aca5	[AOTI] Move stack_allocation logic from CppWrapperCpu (#136462 ) Summary: Move stack_allocation logic from CppWrapperCpu to CppWrapperCpuArrayRef Differential Revision: [D63300359](https://our.internmc.facebook.com/intern/diff/D63300359) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136462 Approved by: https://github.com/chenyang78 ghstack dependencies: #136062, #136461	2024-09-25 14:03:03 +00:00
Bin Bao	5ad5f40283	[AOTI][reland] Create another wrapper class to handle ArrayRef (#136461 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: [D63300361](https://our.internmc.facebook.com/intern/diff/D63300361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136461 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #136062	2024-09-25 14:00:09 +00:00
Edward Z. Yang	25ab87c09b	Add lint rule META_NO_CREATE_UNBACKED (#135870 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135870 Approved by: https://github.com/albanD	2024-09-25 13:33:56 +00:00
Tom Ritchford	dd4a51b39a	Fix constant propagation in builtins and UserClasses (#131354 ) * Fixes https://github.com/pytorch/pytorch/issues/118675 * Replaces https://github.com/pytorch/pytorch/pull/118994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-25 13:03:40 +00:00
Jez Ng	a0c76ea853	Make test_skip_data_serialization regex more flexible (#136580 ) Some CI machines seem to throw "Can't get local object" rather than "Can't pickle local object". Pull Request resolved: https://github.com/pytorch/pytorch/pull/136580 Approved by: https://github.com/mikaylagawarecki	2024-09-25 11:27:23 +00:00
IvanKobzarev	370c1c4297	[aotd] Fix rrelu compilation (#136008 ) Issues: https://github.com/pytorch/pytorch/issues/135083 https://github.com/pytorch/pytorch/issues/120292 rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph. Also that decomposition is registered as python_dispatch kernel for AutogradCUDA. Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this. Testing: ``` python test/functorch/test_aotdispatch.py -k test_rrelu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008 Approved by: https://github.com/bdhirsh	2024-09-25 11:26:19 +00:00
Wu, Chunyuan	c3fdf587b5	[inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518 ) The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes. When there's no epilogue nodes, the function could return `False` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418	2024-09-25 10:25:07 +00:00
Jokeren	cabfbef6cf	[pytorch][PR] [inductor] More fixes on the keys of `constants` and `signature` dictionaries (#136514 ) Summary: Previous PR forgets to change two other places that also create `constants` and `signature`. Test Plan: Imported from GitHub, without a `Test Plan:` line. {F1884584338} Differential Revision: D63027728 Pulled By: Myrthan Co-authored-by: Jokeren <robinho364@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514 Approved by: https://github.com/jansel Co-authored-by: Jokeren <robinho364@gmail.com>	2024-09-25 09:34:14 +00:00
Wu, Chunyuan	2e30c160ef	[inductor] [cpp] fix max-autotune for single-thread dynamic shapes (#136418 ) Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case: ``` /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16, const bfloat16, const bfloat16, bfloat16, int64_t)’: /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression 279 \| constexpr int64_t m_block_end = Mr_blocks; \| ^~~~~~~~~ /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression 237 \| const int64_t Mr_blocks = (M + Mr - 1) / Mr; \| ^~~~~~~~~ ``` The PR also updates the UT to add a test for `BS`=512 in single thread. The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue. By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418 Approved by: https://github.com/jgong5	2024-09-25 09:24:05 +00:00
Anatoly Myachev	a0a1873148	[Inductor] Fix Triton tests after updating pybind11 to 2.13.6 (#136280 ) https://github.com/pytorch/pytorch/pull/136087 update pybind11 to 2.13.6 and that new release has the feature which is expressed by [a new function](https://pybind11.readthedocs.io/en/latest/changelog.html#version-2-13-6-september-13-2024) `_pybind11_conduit_v1_`. The presence of this function breaks the serialization mechanisms used by Titon and in PyTorch itself. Possible errors that have been noticed due to this change: <details> <summary> the first error </summary> ```bash _________ KernelTests.test_layout_constraint_needs_fixed_stride_order __________ Traceback (most recent call last): File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1072, in test_layout_constraint_needs_fixed_stride_order eager_out = f(x) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1068, in f arange_out(x, y) File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1059, in arange_out kernel[grid](x, out, n_elements, BLOCK_SIZE=4) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda> return lambda args, kwargs: self.run(grid=grid, warmup=False, args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 657, in run kernel = self.compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/compiler/compiler.py", line 315, in compile metadata_group[metadata_filename] = fn_cache_manager.put(json.dumps(metadata, default=vars), metadata_filename, File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/__init__.py", line 234, in dumps return cls( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) TypeError: vars() argument must have __dict__ attribute ``` </details> <details> <summary> the second error </summary> ```bash ________________ TestTritonWrapper.test_wrapper_using_gpu_seed _________________ Traceback (most recent call last): File "/cache/pytorch-c5e9d03a2da4b93481737594cbe2f5931fa569aa833f206a638189cad2c36d3c-11/test/inductor/test_triton_wrapper.py", line 40, in test_wrapper_using_gpu_seed out = f(x, y) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn return fn(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1292, in __call__ return self._torchdynamo_orig_callable( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1087, in __call__ result = self._inner_convert( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 530, in __call__ return _compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 933, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 675, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function return function(args, *kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 708, in _compile_inner out_code = transform_code_object(code, transform) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object transformations(instructions, code_options) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 220, in _fn return fn(args, kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 643, in transform tracer.run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2776, in run super().run() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 979, in run while self.step(): File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 891, in step self.dispatch_table[inst.opcode](self, inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2967, in RETURN_VALUE self._return(inst) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2952, in _return self.output.compile_subgraph( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler return self._call_user_compiler(gm) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/__init__.py", line 2235, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1528, in compile_fx return aot_autograd( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__ cg = aot_module_simplified(gm, example_inputs, self.kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified compiled_fn = dispatch_and_compile() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function return _create_aot_dispatcher_function( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base compiled_fw = compiler(fw_module, updated_flat_args) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1357, in fw_compiler_base return _fw_compiler_base(model, example_inputs, is_inference) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1428, in _fw_compiler_base return inner_compile( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 479, in compile_fx_inner return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 665, in _compile_fx_inner compiled_graph = FxGraphCache.load( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1341, in load compiled_graph = compile_fx_fn( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 574, in codegen_and_compile compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 882, in fx_codegen_and_compile compiled_fn = graph.compile_to_fn() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1952, in compile_to_fn return self.compile_to_module().call File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1878, in compile_to_module return self._compile_to_module() File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1906, in _compile_to_module mod = PyCodeCache.load_by_key_path( File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/tmps59zkbew/kg/ckgkb4gt5fs5pll4o7fqawppsmdezu5h52cq6nmrvi3yy6j7ddq4.py", line 45, in <module> File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 198, in triton kernel = TritonCodeCache.load(kernel_name, source_code) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2916, in load return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2853, in load return cls.load_by_key_path(key, path, linemap, attrs) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path mod = _reload_python_module(key, path) File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 39, in _reload_python_module raise RuntimeError( torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Failed to import /tmp/tmps59zkbew/g3/cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py SyntaxError: invalid syntax (cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py, line 14) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136280 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>	2024-09-25 08:09:46 +00:00
Pei-Hsuan Wu	1cb265fafa	[AILab][attempt2] Add TryExcept when decoding healthcheck port (#136574 ) Summary: ## Context The first attempt has lint error in OSS https://hud.pytorch.org/pr/pytorch/pytorch/136438#30553902641 {F1886895223} ## This Diff Fix error message with try catch Error Message: ``` File "/packages/aps_models.examples.dlrm.lite/dlrm_train_app-inplace#link-tree/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 224, in _setup_healthcheck port=int(healthcheck_port), ValueError: invalid literal for int() with base 10: \'%port.thrift%\' ``` Test Plan: ``` arc lint ``` Reviewed By: felixsu2006 Differential Revision: D63343041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136574 Approved by: https://github.com/atalman	2024-09-25 04:43:51 +00:00
Nikita Shulga	561cd5a0a6	[BE] Use C++17 convetion methods in CUDA kernels (#136575 ) - `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>` - `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` And so on Pull Request resolved: https://github.com/pytorch/pytorch/pull/136575 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-25 04:30:01 +00:00
Nikita Shulga	5340feb8aa	Disable iOS workflow (#136571 ) See https://github.com/pytorch/pytorch/issues/136284 It's been broken for more than a week and it does not seem like anyone cares about fixing it. Once it's landed I'll reassigned the issue on `oncall: mobile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136571 Approved by: https://github.com/huydhn, https://github.com/kit1980	2024-09-25 04:29:34 +00:00
Bin Bao	1c9a1a2a19	[AOTI] Support MKL linear ops in cpp wrapper (#134974 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support mkl linear in the ABI-compatible mode for cpp-wrapper Inductor. Differential Revision: [D63322202](https://our.internmc.facebook.com/intern/diff/D63322202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134974 Approved by: https://github.com/chenyang78, https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-09-25 03:53:11 +00:00
chilli	0200ad3457	Turn on unique kernel names (#136503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136503 Approved by: https://github.com/ezyang, https://github.com/eellison ghstack dependencies: #136509	2024-09-25 03:39:45 +00:00
Nichols A. Romero	482fe186b9	Add ROCm documentation to libtorch (C++) reST. (#136378 ) Fixes #126640 Added ROCm support section to libtorch (C++) reST. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136378 Approved by: https://github.com/ezyang	2024-09-25 02:30:56 +00:00
leslie-fang-intel	3c7edf1ec0	[Inductor][CPP] Fix int8 cvt half (#136353 ) Fix the correctness issue of https://github.com/pytorch/ao/pull/884/. The current implementation for converting between `Half/BFloat16` and `int8/uint8` incorrectly assumes that 1/4 of the int8/uint8 vector lane maps to 1/2 of the Half/BFloat16 vector lane. This assumption leads to accuracy issues after the full bit-width vectorization of the Half data type was introduced. When converting between int8 weights and the half data type, the generated code is as the following: ``` #include "/tmp/torchinductor_leslie/xw/cxww3s7wxrujoyxna7mlcjktid2uu6nntixqwm542xfkd756gl3x.h" extern "C" void kernel(const int8_t* in_ptr0, half* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2048L); x0+=static_cast<int64_t>(32L)) { auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); auto tmp1 = at::vec::convert<half>(tmp0); tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32)); } } } ``` In this PR, we address the issue by changing the implementation to convert 1/2 of the int8/uint8 vector lane into a full vector lane of Half/BFloat16. TestPlan * AO: `python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api` * `python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_convert_int8_to_half_vec` * Due to the CPP backend legalization pass, we are unable to create a unit test to simulate the conversion from `Half` to `int8`. Instead, we rely on a C++ test case. * `./build/bin/vec_test_all_types_AVX512 --gtest_filter="VecConvertTestsReducedFloat/.ConvertReduced"` `./build/bin/vec_test_all_types_AVX2 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136353 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-09-25 02:23:43 +00:00
eqy	8225e7706e	[CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496 ) Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496 Approved by: https://github.com/ezyang	2024-09-25 01:14:45 +00:00
Will Cromar	5233b5a448	Update PyTorch/XLA CI image to Python 3.10 (#135278 ) The old image used Python 3.8. Corresponding XLA PR: https://github.com/pytorch/xla/pull/7953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135278 Approved by: https://github.com/JackCaoG, https://github.com/atalman	2024-09-25 00:53:39 +00:00
eqy	670d64a802	[SDPA][Nested Tensor] Bump `grad_query` fudge factor for small GPUs (#135715 ) Similar to #135711, here we see a ~1/1000 mismatch with absolute value ~0.0016 when 0.001 is allowed Pull Request resolved: https://github.com/pytorch/pytorch/pull/135715 Approved by: https://github.com/drisspg	2024-09-25 00:36:10 +00:00
Pearu Peterson	8f2a4cc4b1	Tune bsr_dense_addmm for int8 inputs on A100 (#136088 ) As in the title. The tuning is done for dimensions 1280 and 5120 that are used in Vit-H. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136088 Approved by: https://github.com/cpuhrsch	2024-09-25 00:24:12 +00:00
Justin Chu	9629835b1c	[ONNX] Remove the operators test (#136335 ) The tests are obsolete and hard to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335 Approved by: https://github.com/xadupre	2024-09-24 23:08:48 +00:00
Edward Z. Yang	b57d67e263	Add isuruf to core reviewers (#136554 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136554 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-24 23:06:46 +00:00
angelayi	210b136c07	[export] Add experimental swap API (#136190 ) Prototyped the following API which takes in an ExportedProgram, a dictionary of fqn to modules to swap, and returns a (unlifted) GraphModule ``` _swap_modules( ep: ExportedProgram, modules_to_swap: Dict[str, torch.nn.Module] ) -> torch.fx.GraphModule: ``` Differential Revision: [D62879819](https://our.internmc.facebook.com/intern/diff/D62879819) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136190 Approved by: https://github.com/avikchaudhuri	2024-09-24 22:50:44 +00:00
PyTorch MergeBot	706eda5cd8	Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 )" This reverts commit 5033a1ca0dd22dae34a8939add33dbebfe0fd31d. Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))	2024-09-24 22:24:26 +00:00
William Wen	ae80bce496	[dynamo] refactor resume_execution.py to use bytecode templates (#136483 ) Use bytecode from template instead of hardcoding bytecode in resume_execution.py. Gets rid of a lot of Python-version dependent bytecode generation. Also makes resume_execution.py easier to support in future Python version updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136483 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-24 22:20:26 +00:00
Nikita Shulga	36f0e61166	[BE] Use nested namespace in ATen/native/cuda (#136570 ) It's a nice C++17 feature Pull Request resolved: https://github.com/pytorch/pytorch/pull/136570 Approved by: https://github.com/Skylion007	2024-09-24 22:19:10 +00:00
Jeff Daily	1d3af68202	[ROCm] install_miopen.sh exit for ROCm >= 6.3 (#136436 ) Follow up to #132555. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136436 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/atalman	2024-09-24 22:15:26 +00:00
Justin Chu	780f4debdb	[ONNX] Remove _optimize_graph from public init (#136279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136279 Approved by: https://github.com/xadupre ghstack dependencies: #136281	2024-09-24 22:00:55 +00:00
Edward Z. Yang	00bc17555a	Don't try to evaluate sympy.Eq in replacement; we knew this wouldn't simplify since we are here (#136533 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136533 Approved by: https://github.com/isuruf, https://github.com/pianpwk	2024-09-24 21:52:25 +00:00
Kurt Mohler	b1a02bf708	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-24 21:34:43 +00:00
PyTorch MergeBot	0133fbcfe7	Revert "Correctly convert Python float to float64 when passing argument as Tensor (#136413 )" This reverts commit f0f79dd8f1df6cf6342c9c23ae3a9be0f74eb9f5. Reverted https://github.com/pytorch/pytorch/pull/136413 on behalf of https://github.com/ezyang due to forward fix is stuck, revert this ([comment](https://github.com/pytorch/pytorch/pull/136413#issuecomment-2372404873))	2024-09-24 21:20:37 +00:00
Bin Bao	95c0f7493f	[Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062 ) Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit. Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-09-24 21:02:51 +00:00
Yifu Wang	da1560c49f	[SymmetricMemory] add support for cuStreamWriteValue32 (#136488 ) cuStreamWriteValue efficiently combines the issuing of a system-level fence with the update of a single memory location. It is highly suitable for inter-stream progress sharing (e.g., all_gather_with_progress). Exposing it via SymmetricMemory allows users to more easily implement efficient progress-aware matmuls in triton ([xformers example](https://github.com/facebookresearch/xformers/blob/main/xformers/ops/_triton/sequence_parallel_fused_kernels.py)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136488 Approved by: https://github.com/eqy, https://github.com/Chillee	2024-09-24 20:56:29 +00:00
Justin Chu	7c777dd587	[ONNX] Unify ONNXProgram and remove the old one (#136281 ) ## Note `test_fx_to_onnx_with_onnxruntime.py` is removed for now (it has a lot of xfails anyways). A better version will be added back. Fixes https://github.com/pytorch/pytorch/issues/136274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136281 Approved by: https://github.com/xadupre, https://github.com/albanD	2024-09-24 20:52:19 +00:00
Will Constable	dbc3356655	[pipelining] fix py ref cycle in stage_backward (#136507 ) TLDR; found forward activation tensors were being kept alive "forever" (or until GC ran), and tracked it down to a cycle involving `stage_backward.<locals>.extract_tensors_with_grads`. The reference cycle in question is below. (constructed using gc.get_referrers after doing a gc.collect in gc debug mode) tensor is kept alive by `[(<class 'cell'>, '0x7f7360234400')]` tuple of cell objects `(<cell at 0x7f73602343d0: function object at 0x7f734fff0ee0>, <cell at 0x7f7360234400: list object at 0x7f734e4d9a80>, <cell at 0x7f73602a4190: list object at 0x7f734eff8b00>)` is kept alive by `[(<class 'function'>, '0x7f734fff0ee0')]` `<function stage_backward.<locals>.extract_tensors_with_grads at 0x7f734fff0ee0>` is kept alive by `[(<class 'cell'>, '0x7f73602343d0')]` Put into more plain terms, ``` def stage_backward(...): ... stage_output_tensors = [] # a cell object will exist that contains the variables defined in stage_backward and used by # both stage_backward and nested functions # in this case, the cell object contains 'stage_output_tensors' but # this function object will hold a reference to a 'cell' that contains any vars from # the parent scope not explicitly passed into the function as args. def extract_tensors_with_grads(...): ... # extract_tensors_with_grads refers to stage_output_tensors, so stage_output_tensors # is in the cell stage_output_tensors.append(output_val) ... # but extract_tensors_with_grads ALSO refers to itself (extract_tensors_with_grads), # so `extract_tensors_with_grads` will be in the cell extract_tensors_with_grads(...) ``` More debug details: https://docs.google.com/document/d/1QPH1Lz0tnieIFPM2tyHrjVB-bjlnHuDgjx1p2am3cmE/edit?usp=sharing In pdb: ``` gc.collect() g = gc.garbage g[-1] [rank0]:(Pdb) [rank0]:<function stage_backward.<locals>.extract_tensors_with_grads at 0x7fee5c3392d0> g[-2] [rank0]:(Pdb) [rank0]:(<cell at 0x7fee7abbcf40: function object at 0x7fee5c3392d0>, <cell at 0x7fee7abbcf70: list object at 0x7fee7ab68940>, <cell at 0x7fee5c3210c0: list object at 0x7fee5e1 d6340>) g[-3] [rank0]:(Pdb) [rank0]:[tensor([[[-4.1127e-06, -3.3826e-06, 2.6226e-06, ..., 6.4969e-06, [rank0]: -4.4405e-06, -4.7684e-06], ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136507 Approved by: https://github.com/awgu, https://github.com/kwen2501	2024-09-24 20:46:37 +00:00
chilli	7ff8e66140	Fix flexattention sympy expr printer issue (#136509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136509 Approved by: https://github.com/yanboliang	2024-09-24 20:10:29 +00:00
Henry Tsang	02ef5dd327	[inductor][test] Check if mkl dnn bf16 is supported when using bf16 (#136290 ) Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code. Context: https://github.com/pytorch/pytorch/pull/135038 Differential Revision: D62984129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136290 Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78	2024-09-24 19:32:48 +00:00
Joel Schlosser	888744bd36	NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021 ) Related: #132695 This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes: * `(B, j0, D) + (1, 1, 1)` * `(B, j0, D) + (B, 1, 1)` * `(B, j0, D) + (B, 1, D)` * etc. This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021 Approved by: https://github.com/cpuhrsch	2024-09-24 19:11:49 +00:00
David Berard	8ecc5f1a8f	[TorchScript][tensorexpr] imbue locale for IRPrinter (#136458 ) We had an internal report where the NNC-generated CUDA code had thousands separators in integer literals. Although I wasn't able to cleanly repro, I did come up with a hacky repro and verified that this fix works (see #136459). Differential Revision: [D63278771](https://our.internmc.facebook.com/intern/diff/D63278771) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136458 Approved by: https://github.com/eellison	2024-09-24 19:00:57 +00:00
Nikita Shulga	c6192f32f1	[MPS] Add upsample_bicubic2d as Metal op (#136123 ) More or less literal copy-n-paste of `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)` and `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)` Missing `uint8` implementation mimics CUDA behavior Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk Later refinements: - Switch from 2D dispatch to 1D one (to match CUDA behavior) - Added batch + channel loops - Fixed scale computation to match align corners behavior - Added backward implementation Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e. ```metal emplate <typename T> static inline void atomic_add_helper( device atomic<int>* data, long offset, float value) { auto ptr = data + (offset >> 1); auto old = atomic_load_explicit(ptr, memory_order_relaxed); union { int i; T t[2]; } val; do { val.i = old; val.t[offset & 1] += static_cast<T>(value); } while (!atomic_compare_exchange_weak_explicit( ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed)); } ``` Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123 Approved by: https://github.com/albanD	2024-09-24 18:58:11 +00:00
Animesh Jain	dacf0c4884	[dynamo] Do not treat user defined nn module attributes static for dynamic shape infra (#136516 ) Fixes https://github.com/pytorch/pytorch/issues/136254 Th regression was introduced in https://github.com/pytorch/pytorch/pull/132736 where originally we were trying to fix another regression. This PR and the offending PR together say - "treat user defined nn module attributes as automatic dynamic, but for cudagraphs they will be considered static". This avoid recompilations. This can lead to a cudagraph recording, which is ok. This also maintains the state before inline_inbuilt_nn_modules flag was introduced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136516 Approved by: https://github.com/williamwen42	2024-09-24 18:26:12 +00:00
Sam Larsen	1028cedf71	[inductor] Enable parallel compile by default in fbcode (#136246 ) Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import. Test Plan: Ran codecache tests with JK on, then canaried locally with JK off Differential Revision: D62913998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246 Approved by: https://github.com/eellison	2024-09-24 18:10:01 +00:00
Oguz Ulgen	9abdc62065	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-24 17:23:09 +00:00
ankurneog	efed357ef5	Add dtypes support in opinfo for Intel Gaudi (#132840 ) ## Motivation This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584 we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840 Approved by: https://github.com/albanD	2024-09-24 17:17:15 +00:00
PyTorch MergeBot	064093a4d6	Revert "Increase update_hint_regression problem size to 1000 (#136434 )" This reverts commit 3116fbda0fcf9af0c3dfe1280fb7e05e30e6ad5f. Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))	2024-09-24 17:05:20 +00:00
Shangdi Yu	ebfcbe0822	Move print_export_warning so lru_cache works (#136491 ) Summary: as title move print_export_warning() out of the function so `lru_cache` actually works Test Plan: CI Differential Revision: D63297083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136491 Approved by: https://github.com/pianpwk	2024-09-24 16:52:22 +00:00
Fuzzkatt	44ec706789	add tolerance changes for test_sdpa_autocast in test_nestedtensor.py (#136485 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136485 Approved by: https://github.com/soulitzer	2024-09-24 16:31:32 +00:00
Robert Hardwick	eac04fe72a	Increase bf32 tolerances for some cdist tests in test_torch (#136315 ) - Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal. Logic behind choice of tolerance: The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less. Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow ) ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large self.assertEqual(expected, actual) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Tensor-likes are not close! Mismatched elements: 134118 / 1000000 (13.4%) Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed) Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed) ``` @malfet @jondea Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315 Approved by: https://github.com/albanD	2024-09-24 16:10:11 +00:00
Ma Jian	0b667c073e	Disable compiled autograd for re-entrant autograd (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-24 15:09:16 +00:00
gaopengff	33e10803c8	Fix ut in internal distributed_test.py (#136251 ) I have failed with test case of test_new_subgroups_by_enumeration_input_rank_exceeds_world_size, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251 Approved by: https://github.com/kwen2501	2024-09-24 15:06:20 +00:00
Justin Chu	58274e4655	Remove onnx imports in dynamo (#136334 ) Remove imports of the ``torch.onnx.operators`` module in dynamo. Since ONNX depends on dynamo, this import line causes a circular dependency. Judging from the source they are not actually needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136334 Approved by: https://github.com/xadupre, https://github.com/jansel, https://github.com/titaiwangms	2024-09-24 14:54:23 +00:00
Isuru Fernando	2a178a6982	Avoid changing FTZ/DAZ flags in CPP builder (#136466 ) Fixes https://github.com/pytorch/pytorch/issues/136273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136466 Approved by: https://github.com/ezyang	2024-09-24 14:39:17 +00:00
Fuzzkatt	6300eb1dc7	tf32 off for test_noncontiguous_samples in test_ops.py (#136484 ) Upstreaming minor unit test fix from nvidia internal CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/136484 Approved by: https://github.com/soulitzer	2024-09-24 14:26:47 +00:00
Amadeusz Skrzypczak	47ebb5856e	Make avoid_device_init() aware of hpu device (#136194 ) Added hpu to devices handled by avoid_device_init() in FakeTensorMode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136194 Approved by: https://github.com/eellison	2024-09-24 14:13:45 +00:00
enkilee	54fc4f56ff	[Docs fix] fix syntax error in docs :torch.blackman_window (#136354 ) Fixes #ISSUE_NUMBER https://pytorch.org/docs/stable/generated/torch.blackman_window.html error at : equal to torch.blackman_window(L + 1, periodic=False)[:-1]). should delete the last ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/136354 Approved by: https://github.com/soulitzer	2024-09-24 14:00:26 +00:00
Aaron Orenstein	9fc721d22b	Add cache logs + other minor caching cleanup (#136456 ) Summary: - Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache. - Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other) - Prepare `_ManifoldCache` for use with other subpath keys - Move create_cache to be more public and use it in codecache - Add _InductorMetaTy alias (still just a dict) - Cleaned up some common cached_autotune calls in triton_heuristics Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456 Approved by: https://github.com/oulgen	2024-09-24 14:00:23 +00:00
IvanKobzarev	342c031f0e	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-24 13:15:01 +00:00
cyy	f048569c24	[Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439 ) Follows #131671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439 Approved by: https://github.com/kwen2501	2024-09-24 13:05:15 +00:00
PyTorch MergeBot	538ee7bf60	Revert "Fix tensor.data_ptr() representation overflow (#135567 )" This reverts commit 2e8d431a8fbfdbdb07448195f16afa9e101188ac. Reverted https://github.com/pytorch/pytorch/pull/135567 on behalf of https://github.com/etaf due to Block XPU, let's re-land with triton update. ([comment](https://github.com/pytorch/pytorch/pull/135567#issuecomment-2371200549))	2024-09-24 12:59:14 +00:00
Bob Ren	32727b9859	Add types to _dynamo/testing.py (#136402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136402 Approved by: https://github.com/jansel	2024-09-24 10:23:54 +00:00
Xuehai Pan	73c10a04f6	[dynamo][easy] support `sys.intern` (#136081 ) Closes #134023 - #134023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136081 Approved by: https://github.com/anijain2305	2024-09-24 09:12:34 +00:00
Amin Alam	1266be21f4	deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141 ) Fix to #136140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141 Approved by: https://github.com/kwen2501	2024-09-24 07:26:10 +00:00
Jianyu Huang	0a35986cdb	Add option to configure reduced precision math backend for SDPA (#135964 ) Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA. Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels Differential Revision: D62625515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964 Approved by: https://github.com/jbschlosser	2024-09-24 07:11:38 +00:00
Wu, Chunyuan	44c871c34b	[inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661 ) ## Description Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune. The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same). If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`. This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied. ### Unsupported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: 401408 * d0 + 100352 * d1 + *7168 d2 + 1792 * d3** + 128 * d4 + d5 The load of `buf1` in the epilogue node: 401408 * d0 + 100352 * d1 + *1792 d2 + 25088 * d3** + 128 * d4 + d5 The above two indexes are different. ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): i0, i1, i2, i3, i4, i5 = index tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp2 = tmp0 + tmp1 tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp4 = tmp2 + tmp3 return tmp4 , ranges=[8, 4, 14, 4, 14, 128], origin_node=clone, origins=OrderedSet([clone]) )) ``` ### Supported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: d0 + 576 * d1 + 32 * d2 The load of `buf1` in the epilogue node: d0 + 576 * d1 + 32 * d2 The above two indexes are the same. The layout of `buf2` and `buf1` are different though which is handled by the reindexer: `buf1`: `size=[324, 32], stride=[32, 1]` `buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]` ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise( 'cpu', torch.bfloat16, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(_frozen_param4, i1) tmp3 = tmp1 * tmp2 tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2) tmp5 = tmp3 + tmp4 tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32) return tmp6 , ranges=[1, 32, 18, 18], origin_node=convert_element_type_4, origins=OrderedSet([add, mul, convert_element_type_4]) )) ``` ## TODO Add the support for fusions when the indexes are different in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 05:25:28 +00:00
Max Podkorytov	7283530db2	[ROCm][Inductor][CK] FP8 gemm (#136337 ) At the moment, lowering torch._scaled_mm with tensorwise scaling and rowwise scaling for both A and B We probably also want to support either combination of tensorwise and rowwise for A and B, as well as bias support Pull Request resolved: https://github.com/pytorch/pytorch/pull/136337 Approved by: https://github.com/chenyang78	2024-09-24 05:19:45 +00:00
Aaron Orenstein	7f98781f84	Fix autodeps from D62049222 that pyfmt broke (#136455 ) Summary: `arc lint` changed the formatting which then caused autodeps to be confused. Test Plan: this passes: ``` arc lint --skip AUTODEPS fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/test/inductor/test_memory_planning.py ``` Differential Revision: D63277059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136455 Approved by: https://github.com/bobrenjc93, https://github.com/oulgen	2024-09-24 05:06:12 +00:00
blzheng	797c7e2802	[Quant][PT2E]change flatten recipe for X86InductorQuantizer (#136298 ) This PR modifies the flatten recipe: if none of the users of the flatten node are quantizable ops, int8 flatten will be disabled to avoid unnecessary dtype conversions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136298 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 04:30:12 +00:00
Riley Dulin	3be150653c	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-24 03:28:12 +00:00
Guilherme Leobas	e09c5b6046	Remove `vt` argument in `raise_observed_exception` (#136037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037 Approved by: https://github.com/zou3519	2024-09-24 02:36:57 +00:00
fduwjj	9372692c7b	[FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473 ) Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473 Approved by: https://github.com/c-p-i-o	2024-09-24 02:34:43 +00:00
Nikita Shulga	4fd16dd8aa	Clarify that `libtorch` API is C++17 compatible (#136471 ) As it relies on some common C++17 primitives, such as `std::optional` Replace all docs references from C++14 to C++17 Fixes https://github.com/pytorch/pytorch/issues/133205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-24 02:03:33 +00:00
Jez Ng	e4d294221b	[inductor] Log precompilation time (#136395 ) This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395 Approved by: https://github.com/eellison	2024-09-24 01:47:54 +00:00
Edward Z. Yang	802ba79121	Inherit all secrets to inductor workflow (#135354 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354 Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet	2024-09-24 01:30:40 +00:00
Aaron Orenstein	06909803cc	Existing mypy issues (#136236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2024-09-24 01:02:07 +00:00
Xuan Zhang	a14f57b126	fix the inductor tests (#136474 ) Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474 Approved by: https://github.com/malfet	2024-09-24 00:59:22 +00:00
Nikita Shulga	9d9bc65b5e	Make `FlashAttentionKernel.cpp` compilable for SVE with GCC-11 (#136477 ) Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528 Hattip to @abhishek-iitmadras for the investigation Fixes https://github.com/pytorch/pytorch/issues/136432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477 Approved by: https://github.com/atalman, https://github.com/kit1980	2024-09-24 00:54:26 +00:00
Ke Wen	e0f84f40f7	[Pipelining] Allow non-0 stages to accept kwargs (#136416 ) For supporting usage case in torchchat: all non-0 stages requires `input_pos` and `cache_lane`. ``` kwargs = {"input_pos": input_pos, "cache_lane": lane} if pp_rank == first_pp_rank: output = decorder.step(new_token, kwargs) elif pp_rank == last_pp_rank: output = decorder.step(kwargs) else: # middle pp ranks decorder.step(**kwargs) ``` The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416 Approved by: https://github.com/wconstab	2024-09-23 23:50:59 +00:00
Guilherme Leobas	52c917b0ba	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-23 21:45:44 +00:00
fduwjj	5033a1ca0d	[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 ) 1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o	2024-09-23 20:32:24 +00:00
PyTorch MergeBot	fd182b90a7	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit d45b0151e5d9a9358368b9fbd7fa454edd5d9709. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))	2024-09-23 19:57:13 +00:00
Nikita Shulga	08dba25775	[BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449 ) - `Tensor::type()` -> `Tensor::scalar_type()` - `Tensor::data<T>()` -> `Tensor::data_ptr<T>()` Should fix following warnings during the compilation: ``` caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o[0m /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 739 \| int* rowOffsets = crow.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 740 \| int* colIndices = col.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here 225 \| DeprecatedTypeProperties & type() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449 Approved by: https://github.com/huydhn	2024-09-23 19:20:34 +00:00
Xiaodong Wang	9a1dc41de7	[AMD] Skipping 0 byte send/recv for AMD GPU (#136362 ) Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them. Reviewed By: danzimm Differential Revision: D63075000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362 Approved by: https://github.com/malfet, https://github.com/houseroad	2024-09-23 19:14:12 +00:00
Edward Z. Yang	3116fbda0f	Increase update_hint_regression problem size to 1000 (#136434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434 Approved by: https://github.com/laithsakka	2024-09-23 18:51:44 +00:00
PyTorch MergeBot	274883083d	Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318 )" This reverts commit d21841d077b00350d5e621e7b74dace71849c701. Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))	2024-09-23 17:47:49 +00:00
Aleksei Nikiforov	d859fcbc61	s390x: build s390x binaries on each pull request (#125399 ) Ensure that s390x keeps building for each PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399 Approved by: https://github.com/huydhn	2024-09-23 17:39:48 +00:00
Joel Schlosser	83a3ee0699	Support embedding_bag() with NJT input (#135888 ) Fixes #93843 `EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888 Approved by: https://github.com/cpuhrsch	2024-09-23 17:35:19 +00:00
James Wu	4649aeaebf	Make AOTAutogradCache support remote FXGraphCache (#136173 ) Summary: After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache. This does not implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache. Test Plan: (Meta only tests) Reviewed By: aorenste Differential Revision: D62384944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173 Approved by: https://github.com/oulgen	2024-09-23 17:24:27 +00:00
Nikita Shulga	c3e678382b	Fix addmm silent correctness on aarch64 (#136371 ) Do not dispatch to fast gemmv functions when alpha is not equal to 1 Add regression test to address the problem Fixes https://github.com/pytorch/pytorch/issues/136299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371 Approved by: https://github.com/swolchok	2024-09-23 17:10:34 +00:00
Edward Z. Yang	f0f79dd8f1	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93	2024-09-23 16:48:08 +00:00
wz337	637d5c4b7e	[DSD] Fix loading uneven full tensor into sharded state dict (#136365 ) Fix #136228. This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365 Approved by: https://github.com/fegin	2024-09-23 16:35:58 +00:00
fduwjj	da51fe1c42	[FR] Fix errors in all2all check, improve some log output (#136399 ) We found that we show the hashed pg name in our script output, which is not UX friendly. Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable. Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399 Approved by: https://github.com/c-p-i-o	2024-09-23 16:31:31 +00:00
PyTorch MergeBot	df6a8fa1eb	Revert "[aotd] Fix freezing API for subclasses (#136265 )" This reverts commit cdef760560049ebda5fb7e30b1703f345fe05cfa. Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))	2024-09-23 16:25:05 +00:00
Andrew Gu	9992084f38	[FSDP2] Fixed `test_all_gather_extensions_monkey_patch` (#136130 ) I messed up the test before. The extensions were not running :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130 Approved by: https://github.com/weifengpy ghstack dependencies: #136129	2024-09-23 15:12:44 +00:00
Andrew Gu	b9f53c0dce	[FSDP2] Added module, mp policy to `fsdp_pre_all_gather` (#136129 ) - Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR. - Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example. The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129 Approved by: https://github.com/weifengpy	2024-09-23 15:12:36 +00:00
Bin Bao	d21841d077	[AOTI] Create another wrapper class to handle ArrayRef (#136318 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: D62961885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318 Approved by: https://github.com/frank-wei	2024-09-23 15:10:27 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
Edward Z. Yang	bae427e4b1	Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107 ) By refactoring this way, I can put a non-expiring LRU cache here. Splitting also will make it easier for me to tell who is using up all the time. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107 Approved by: https://github.com/aorenste	2024-09-23 14:39:20 +00:00
PyTorch MergeBot	e9bfbf78d5	Revert "Allow fx graph caching higher order operators (opt-in) (#135877 )" This reverts commit 66d5eb64e0be91680a8531ccb24f098554610d46. Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))	2024-09-23 09:04:24 +00:00
cyy	75f141be62	Avoid unnecessary CMake warnings on Windows (#136393 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393 Approved by: https://github.com/ezyang	2024-09-23 06:42:59 +00:00
Yuxin Wu	663e760065	add unittest for OOM message (#129671 ) Add unittest for the bug in #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671 Approved by: https://github.com/eqy	2024-09-23 04:48:01 +00:00
Yiming Zhou	068fdd602f	[export] enable custom tag metadata re-export test (#136048 ) Improves and enables a commented out test originally introduced in #131912 In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048 Approved by: https://github.com/zhxchen17	2024-09-23 04:37:58 +00:00
Oguz Ulgen	66d5eb64e0	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-23 04:33:27 +00:00
cyy	a38e4c5e1e	Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547 Approved by: https://github.com/ezyang	2024-09-23 03:44:55 +00:00
Isuru Fernando	f276da7f98	Remove prims.slice_in_dim and prims.slice (#136150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150 Approved by: https://github.com/ezyang	2024-09-23 01:27:22 +00:00
Xilun Wu	3406ac24d9	[BE] fix circular import in torch/distributed/utils.py (#136286 ) Summary Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286 Approved by: https://github.com/Skylion007	2024-09-22 20:54:12 +00:00
Shangdi Yu	3bc073d728	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-22 04:51:37 +00:00
Zhou, Lingzhi	35532fc477	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-21 23:52:02 +00:00
cyy	e4cdc31227	[14/N] Fix clang-tidy warnings in aten/src/ATen (#133988 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988 Approved by: https://github.com/ezyang	2024-09-21 22:41:40 +00:00
Bob Ren	9731ccb9e0	Type _dynamo/variables/lazy.py (#136376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376 Approved by: https://github.com/Skylion007	2024-09-21 22:18:02 +00:00
Jovian Anthony Jaison	09715638ab	Add _dynamo.config.suppress_errors logging (#136379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379 Approved by: https://github.com/ezyang	2024-09-21 21:00:26 +00:00
Aaron Orenstein	3176966732	update cache tests (#136215 ) Summary: - Clean up cache test code a bit. - Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check). Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215 Approved by: https://github.com/bobrenjc93	2024-09-21 20:36:22 +00:00
Ramana Sundararaman	be4b7e8131	Param fixes in docstring (#136097 ) Fixes wrong param names in docstrings. cc: @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097 Approved by: https://github.com/ezyang	2024-09-21 18:56:34 +00:00
Aaron Gokaslan	b6ffa381e1	[BE]: Add half CUDA support nextafter (#136373 ) Making CUDA support match CPU support for nextafter Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373 Approved by: https://github.com/ezyang	2024-09-21 17:13:45 +00:00
PyTorch MergeBot	cc17d58809	Revert "S390x update builder image (#132983 )" This reverts commit 080a249fc2290602402e01bf5864d9d9a416e5b6. Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))	2024-09-21 16:46:54 +00:00
Xuan Zhang	03957efa5d	[inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874 ) Motivations: A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf). Solutions: 1. implement a peak memory estimator via liveness analysis 2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory Results: On some models we can reduce the peak memory significantly: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:-----------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| alexnet \| 128 \| 1.17 \| 0.99 \| 1.19 \| \| vgg16 \| 64 \| 4.10 \| 3.57 \| 1.15 \| \| DebertaV2ForQuestionAnswering \| 1 \| 11.60 \| 10.56 \| 1.10 \| In the presence of compiler based AC, peak memory can be further reduced: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:------------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| AlbertForMaskedLM \| 4 \| 6.87 \| 6.43 \| 1.07 \| \| AlbertForQuestionAnswering \| 4 \| 8.69 \| 7.76 \| 1.12 \| \| MobileBertForQuestionAnswering \| 128 \| 4.67 \| 3.90 \| 1.20 \| [Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case. Other infos: * neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_. * minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second. * no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874 Approved by: https://github.com/yf225	2024-09-21 16:28:38 +00:00
DavidGu-Datong	fb4670a1f9	fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174 ) Fixes #134848 For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174 Approved by: https://github.com/soulitzer	2024-09-21 14:21:42 +00:00
Will Constable	ea737e4e5d	[Pipelining] Make PipelineStage support meta initialization (#136243 ) Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243 Approved by: https://github.com/H-Huang, https://github.com/kwen2501	2024-09-21 09:47:22 +00:00
cyy	c459430558	Pass Werror to CUDA host compiler (#130213 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213 Approved by: https://github.com/ezyang	2024-09-21 08:01:06 +00:00
Menglu Yu	e18439113e	[PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349 ) Summary: see D62220158 Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pad_mm -- --exact 'caffe2/test/inductor:pad_mm - test_pad_mm_bf16 (caffe2.test.inductor.test_pad_mm.PadMMTest)' --run-disabled ``` ### H100 Buck UI: https://www.internalfb.com/buck2/e5d85802-cab7-41a5-aacc-95f541796a99 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149258587374 Network: Up: 9.1KiB Down: 0B (reSessionID-b339b51b-6a0e-4347-9414-1ba38f26a5d0) Jobs completed: 9. Time elapsed: 1:15.7s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 1. Build failure 0 ### A100 Buck UI: https://www.internalfb.com/buck2/1082ad6e-56b0-4eb5-8092-ce507ca9a70e Test UI: https://www.internalfb.com/intern/testinfra/testrun/8444249533824784 Network: Up: 9.2KiB Down: 0B (reSessionID-2b3056ac-f29e-4de4-b6f5-9d994acf566b) Jobs completed: 9. Time elapsed: 1:36.9s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E see D62220158 Differential Revision: D63040455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136349 Approved by: https://github.com/dshi7	2024-09-21 06:35:50 +00:00
cyy	02871461f7	Fix clang-tidy warnings in torch/csrc/lazy (#134655 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134655 Approved by: https://github.com/ezyang	2024-09-21 02:59:35 +00:00
Laith Sakka	0b91e7e2dc	Remove duplicate line (#136383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-21 01:35:13 +00:00
eqy	29f7b8d483	[TF32] Account for TF32 in `test_conv_double_backward` (#135716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135716 Approved by: https://github.com/Skylion007	2024-09-21 01:06:22 +00:00
Nikita Shulga	7936584a88	Fix `Vectorized<double>::next_after` SVE compilation (#136388 ) Should have called [`Sleef_nextafterdx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-double-precision-function-for-obtaining-the-next-representable-fp-value) rather than [`Sleef_nextafterfx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-single-precision-function-for-obtaining-the-next-representable-fp-value) to get vectorized `nextafter` for double precision rather than single precision values This fixes a compilation issue introduced by https://github.com/pytorch/pytorch/pull/119571 and exposed by https://github.com/pytorch/pytorch/pull/133339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136388 Approved by: https://github.com/kit1980	2024-09-20 23:54:17 +00:00
albanD	067d203b22	Upgrade pybind11 API calls for 3.13t (#136370 ) This is a modified version of https://github.com/pytorch/pytorch/pull/130341 that preserve support for older pybind version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136370 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-09-20 23:09:55 +00:00
Colin Peppler	1a10751731	[AOTI][Tooling] Filter out kernels based off lowercase names (#135395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135395 Approved by: https://github.com/YUNQIUGUO	2024-09-20 21:56:08 +00:00
Isuru Fernando	0c936c3ecb	Add decomps for max_unpool (#133146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-20 21:35:25 +00:00
侯奇	293fccf86d	add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012 ) `torch::cuda::nccl` is an option for developers to depend only on torch but not nccl. But to use `torch::cuda::nccl::send`/`torch::cuda::nccl::recv`, `ncclGroupStart()`/`ncclGroupEnd()` is needed, `torch::cuda::nccl::AutoNcclGroup` can be used. but `torch::cuda::nccl::AutoNcclGroup` is not exported and is LOCAL symbol, which can't be used from outside of libtorch. <img width="1618" alt="image" src="https://github.com/pytorch/pytorch/assets/1913192/25b0bd54-2da6-480f-876d-b05acfecfe62"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130012 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-09-20 21:20:25 +00:00
Matthew Sterrett	239a9ad65e	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-20 21:19:33 +00:00
cyy	d2455b99fb	Use cpython declaration of _PyWeakref_ClearRef (#136300 ) To avoid the DLL inconsistency warning by MSVC: ``` torch/csrc/utils/python_compat.h(38): warning C4273: '_PyWeakref_ClearRef': inconsistent dll linkage ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136300 Approved by: https://github.com/Skylion007	2024-09-20 18:58:58 +00:00
Bob Ren	7f9c06462f	fix mypi in utils/_sympy/functions.py (#136339 ) Signed-off-by: Bob Ren <bobren@fb.com> Turns out older versions of python, in particular 3.8 shows errors that 3.12 doesn't. For posterity these are the steps I took to reproduce: ``` conda create -n py38 python=3.8 conda activate py38 pip install -r requirements.txt lintrunner init dmypy restart && lintrunner --all-files --take MYPY ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136339 Approved by: https://github.com/Skylion007 ghstack dependencies: #136205	2024-09-20 18:39:16 +00:00
Bin Bao	f53a0f9cc1	[Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356 ) Summary: Internal profiler behaves differently after turning on triton.autotune_at_compile_time. Needs more investigation but turning it off for this test for now. Reviewed By: henrylhtsang Differential Revision: D63035855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136356 Approved by: https://github.com/henrylhtsang	2024-09-20 18:35:09 +00:00
Xu Song	5997354151	Add more distributed examples (#130427 ) 1. Add `gather` example 2. Add device to `scatter` example Pull Request resolved: https://github.com/pytorch/pytorch/pull/130427 Approved by: https://github.com/kwen2501	2024-09-20 18:27:27 +00:00
PyTorch MergeBot	df1eef9779	Revert "[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 )" This reverts commit f3c54ccf8f6139807f4623037c0174964a286652. Reverted https://github.com/pytorch/pytorch/pull/136282 on behalf of https://github.com/huydhn due to This breaks OSS, let revert it and land the revert internally then ([comment](https://github.com/pytorch/pytorch/pull/136282#issuecomment-2364219252))	2024-09-20 17:49:06 +00:00
Jeff Daily	15dba021bb	[ROCm][CI] upgrade CI to ROCm 6.2 (#132555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-09-20 17:39:31 +00:00
Chirag Pandya	29affa6b95	return instead of using skipTest (#136244 ) Summary: Return from functions instead of using `skipTest`. This is mostly to make our test report happier. Skipped tests still show up in our Broken test report. ``` OK (skipped=1) I0917 16:14:24.749060 1018907 StorageDemandControl.cpp:572] Flushing Demand Control ODS counters Skipped: Store doesn't support extended APIs ``` Test Plan: Tested locally. Test shows up as passed instead of skipped. ``` Cache hits: 99%. Commands: 125048 (cached: 124961, remote: 10, local: 77) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62912065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136244 Approved by: https://github.com/XilunWu	2024-09-20 17:36:28 +00:00
David Berard	d7a6980078	[inductor] Make DtypeView work with cpp_wrapper without abi_compatible (#136233 ) Fixes #136159 Prior to this PR, using cpp_wrapper without abi_compatible could result in incorrect dtypes. The following block of code implements cpp_wrapper codegen for reinterpret_view for abi_compatible mode, but not for non-abi_compatible mode. `f6f1504d39/torch/_inductor/codegen/cpp_wrapper_cpu.py (L1678-L1814)` Added a test that verifies that we keep the view behavior, but returned tensors also have correct dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136233 Approved by: https://github.com/FindHao, https://github.com/eellison, https://github.com/jansel	2024-09-20 17:30:35 +00:00
Aleksei Nikiforov	080a249fc2	S390x update builder image (#132983 ) S390x update builder image Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-20 17:26:26 +00:00
PyTorch MergeBot	783c5ba80a	Revert "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765 )" This reverts commit 0b81f700aa7eb20d4b9f20e9627dd1208e50ea58. Reverted https://github.com/pytorch/pytorch/pull/132765 on behalf of https://github.com/ezyang due to implementation is not correct, needs full rewrite ([comment](https://github.com/pytorch/pytorch/pull/132765#issuecomment-2364160452))	2024-09-20 17:10:27 +00:00
IvanKobzarev	cdef760560	[aotd] Fix freezing API for subclasses (#136265 ) Original issue: https://github.com/pytorch/ao/issues/890 The problem: TracingContext.flat_params contain original params, with not desugared Subclasses. While inductor.freezing API works on aot graphs, which already desugared Subclasses. flat_params are used only for this logic and storing in them desguared subclasses fixes the issue. Testing: ``` python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses ``` Torch AO original failure: ``` python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265 Approved by: https://github.com/bdhirsh	2024-09-20 16:32:49 +00:00
Aditya Tewari	4842f0fac6	Enable torch build with SLEEF on ARM by default (#133339 ) Scope: Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform. Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend. On models like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a ~1.2x speedup (with 16 threads). The below results are run with `Batch_Size=1` and `Cores=8, 16` ![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339 Approved by: https://github.com/malfet, https://github.com/kimishpatel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-20 16:02:32 +00:00
Riley Dulin	f3c54ccf8f	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-20 07:34:52 +00:00
Sun, Jiayi	687e5cf8c5	[inductor] Relax the conditions for loop split (#135335 ) Summary This PR Relaxes the conditions for loop split to support dynamic shape cases. Now the conditions that need to be met to apply loop split optimization are as follows: 1. No reduction and no mudular index for all nodes. 2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs. Example: ``` import torch import torch.nn as nn class GN(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m, dynamic=True) with torch.no_grad(): compiled_m(input) ``` Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960s22z0 + 960s2z1 + 960z2 + z3, 'index1': 32z0 + (z3//30), 'index2': 30s22, 'index3': z3, 'index4': 960s2z0((s2*2//s2)) + 960z1((s22//s2)) + 960z2 + z3}`. After loop split `z3` will split to `30z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960s2*2z0 + 960s2z1 + 960z2 + 30z3 + z4, 'index1': 32z0 + z3, 'index2': 30s2*2, 'index3': 30z3 + z4, 'index4': 960s2z0((s22//s2)) + 960z1((s22//s2)) + 960z2 + 30z3 + z4}` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1))))]; auto tmp1 = out_ptr0[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp3 = out_ptr1[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp11 = in_ptr1[static_cast<int64_t>(x3)]; auto tmp13 = in_ptr2[static_cast<int64_t>(x3)]; auto tmp2 = decltype(tmp0)(tmp0 - tmp1); auto tmp4 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp5 = c10::convert<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = decltype(tmp2)(tmp2 tmp9); auto tmp12 = decltype(tmp10)(tmp10 * tmp11); auto tmp14 = decltype(tmp12)(tmp12 + tmp13); out_ptr2[static_cast<int64_t>(x3 + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))] = tmp14; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L)) { for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))); } for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L)); } } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2*s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-09-20 05:42:52 +00:00
albanD	cf31724db7	Fix and improvements to toward 3.13t (#136319 ) Small part of https://github.com/pytorch/pytorch/pull/130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136319 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-09-20 04:22:18 +00:00
Tom Ritchford	e3ea5429f2	Implement GetAttrVariable.as_python_constant() (#134216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134216 Approved by: https://github.com/amjames, https://github.com/williamwen42	2024-09-20 03:44:43 +00:00
Sergii Dymchenko	d9aca9914b	Remove duplicated words in library.rst (#136340 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136340 Approved by: https://github.com/svekars	2024-09-20 03:30:54 +00:00
Huy Do	fe0e9fb385	Fix flaky SIGSEGV crash in test_profile_memory (#136304 ) Fixes https://github.com/pytorch/pytorch/issues/132331 We need another barrier here to ensure that the main thread doesn't stop the profiler while other threads are still using it (and crash). I can reliably reproduce the issue with `pytest -v test/profiler/test_cpp_thread.py -k test_profile_memory --flake-finder`. ### Testing `pytest -v test/profiler/test_cpp_thread.py --flake-finder` all passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136304 Approved by: https://github.com/briancoutinho	2024-09-20 02:56:49 +00:00
Kurt Mohler	d45b0151e5	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-20 02:41:56 +00:00
Felix Su	1dfa07e885	passing FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913 ) Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user. Test Plan: unit tests Reviewed By: gag1jain Differential Revision: D62408767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913 Approved by: https://github.com/gag1jain	2024-09-20 00:54:02 +00:00
Tristan Rice	bebf5302ba	TCPStoreLibUvBackend: trace operations (#136320 ) Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu	2024-09-20 00:53:21 +00:00
Wei Wang	9b424aac1d	[CI][CUSPARSELT] Extend cusparselt installation script to support cuda 12.6 (#136321 ) To prepare for future cuda updates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136321 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-19 23:45:57 +00:00
Brian Hirsh	172ecf78b7	DTensor: dont hash symint tensor input in propagate_tensor_meta (#136266 ) This fixes a subset of issues for dynamic shapes + DTensor. It's pretty easy to run into other issues - it's likely that we need https://github.com/pytorch/pytorch/pull/125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/136266 Approved by: https://github.com/ezyang, https://github.com/yf225	2024-09-19 20:39:36 +00:00
cyy	7bbdf87517	[22/N] Fix clang-tidy warnings in jit (#134829 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829 Approved by: https://github.com/ezyang	2024-09-19 19:24:42 +00:00
Laith Sakka	b71802fa79	add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175 Approved by: https://github.com/ezyang	2024-09-19 19:15:50 +00:00
Rachel Guo	8cba0ec958	[AOTI][Tooling][8/n] Add option to pinpoint kernel names in debug printer (#136182 ) Summary: Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info. It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues. thanks ColinPeppler and henrylhtsang for this "feature request". Test Plan: The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`: {F1871629091} Differential Revision: D62791371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136182 Approved by: https://github.com/henrylhtsang	2024-09-19 18:51:57 +00:00
Shan19900305	49723a8ff3	fix stride compare failed when size value equal to one in ForeachUtils.h (#134546 ) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546 Approved by: https://github.com/janeyx99	2024-09-19 18:43:41 +00:00
Jerry Mannil	ccca3de0cd	[ROCm] Enable Flex attention tests on AMD gpus (#136245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136245 Approved by: https://github.com/malfet	2024-09-19 18:02:41 +00:00
Bob Ren	8d9c42735a	Type _sympy/functions.py [1/n] (#136205 ) Signed-off-by: Bob Ren <bobren@fb.com> I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/136205 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu	2024-09-19 17:15:53 +00:00
James Wu	803ce507f1	Log structured logging overhead to dynamo compile (kinda) (#136142 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number in compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142 Approved by: https://github.com/bobrenjc93	2024-09-19 16:11:38 +00:00
Andrew Gu	65df26f615	[FSDP2] Fixed 2D mismatched grad placements (#136237 ) ``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237 Approved by: https://github.com/weifengpy	2024-09-19 14:35:15 +00:00
PyTorch MergeBot	4ea741d24f	Revert "Reland D62220158 (#136213 )" This reverts commit 083c9149b75cd918b6fb2795050d7173923a3629. Reverted https://github.com/pytorch/pytorch/pull/136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](https://github.com/pytorch/pytorch/pull/136213#issuecomment-2360885064))	2024-09-19 12:44:54 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
Jan Wieczorek	908a5689eb	Return unsafe_view instead of view from matmul when folding occurs (#134568 ) When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors. It can be especially problematic when after such function inplace allreduce is performed. Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned. Test included in this PR reproduces the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568 Approved by: https://github.com/zou3519	2024-09-19 11:52:16 +00:00
Huy Do	db80b98ec4	XFAIL test_segfault (#136252 ) Fixes https://github.com/pytorch/pytorch/issues/128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252 Approved by: https://github.com/andrewkho	2024-09-19 04:17:06 +00:00
Duygu Altinok	775517693a	Add type checks for Tensor.add_ (#135864 ) Fixes #127049 There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` . Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864 Approved by: https://github.com/williamwen42	2024-09-19 03:09:36 +00:00
William Wen	e037bb326f	[dynamo] fix crash in InspectSignatureVariable (#136010 ) Fix crash that was happening in https://github.com/pytorch/pytorch/issues/128095, because we were trying to extract a constant incorrectly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136010 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel	2024-09-19 00:23:00 +00:00
Jerry Zhang	f2b0fc89f2	Add uint16 support for observer (#136238 ) Summary: att Test Plan: python test/test_quantization.py -k TestObserver Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136238 Approved by: https://github.com/tarun292	2024-09-18 23:52:18 +00:00
Nikita Shulga	068c80e6b6	[BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292 ) [reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc) Without it, following warnings are generated if compiled on recently released MacOS Sequoia: ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 720 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph , CachedGraph >' requested here 341 \| decltype(std::declval<_Fp>()(std::declval<_Args>()...)) \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph , CachedGraph >] 351 \| static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)] 357 \| using _Result = decltype(__try_call<_Fp, _Args...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph , CachedGraph >' requested here 27 \| __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper' 38 \| using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all) 828 \| bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here 841 \| using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>; \| ^~~~~~~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here 851 \| template <class _Fp, class = _EnableIfLValueCallable<_Fp>> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here 852 \| _LIBCPP_HIDE_FROM_ABI function(_Fp); \| ^~~~~~~~~~~~~ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 745 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ 2 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136292 Approved by: https://github.com/kit1980	2024-09-18 23:38:31 +00:00
Nikita Shulga	b9a197df77	[BE][MPS] Delete duplicated code in `View.mm` (#136295 ) After https://github.com/pytorch/pytorch/pull/135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136295 Approved by: https://github.com/kit1980	2024-09-18 22:44:43 +00:00
Siju Samuel	f1ad680818	[dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763 ) Fixes #ISSUE_NUMBER Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/131763 Approved by: https://github.com/yanboliang, https://github.com/yf225	2024-09-18 22:32:34 +00:00
Will Feng	bc9597b7d8	[Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219 ) Changes in this PR: - Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda. - Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests. - The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219 Approved by: https://github.com/yifuwang	2024-09-18 22:30:23 +00:00
Isuru Fernando	1a86d8aa29	Fix calling Add._from_args and Mul._from_args (#136143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136143 Approved by: https://github.com/ezyang	2024-09-18 20:51:04 +00:00
Atul Jangra	aae68e2976	Add wait counter for nccl abort (#136067 ) Summary: Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack. This will help us measure how much time we take the NCCL abort. Test Plan: Unit tests Reviewed By: c-p-i-o Differential Revision: D62675010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067 Approved by: https://github.com/fduwjj	2024-09-18 20:14:10 +00:00
eqy	68a7246f13	[cuDNN][conv][A100] Bump tolerances for `vmap_autograd_grad` `conv2d` on A100 (#136178 ) Likely due to a cuDNN heuristics update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136178 Approved by: https://github.com/Skylion007	2024-09-18 19:42:13 +00:00
maajidkhann	5a6ddbcc3b	Extending the Pytorch vec backend for SVE (ARM) (#119571 ) Motivation: In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes. Reference Link: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec This PR: * Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE. * More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions) * We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec. * Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571 Approved by: https://github.com/malfet, https://github.com/snadampal Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>	2024-09-18 18:59:10 +00:00
Jack Taylor	bad69044d8	[ROCm] upgrade ROCm CI builds to py3.10 (#134108 ) Upgrade ROCm CI builds to py3.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134108 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-18 17:39:34 +00:00
fduwjj	3efaa016b1	[c10d] Make test compatible for new pytest (#136158 ) Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517. Short-term fix following CPython: `51aefc5bf9/Lib/unittest/case.py (L419-L426)` Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158 Approved by: https://github.com/fegin	2024-09-18 17:10:55 +00:00
Scott Wolchok	605f2d802a	[PyTorch] Remove unnecessary include of c10/util/Exception.h in irange.h (#136202 ) Manually audited and can't figure out why this would be needed. Differential Revision: [D62879500](https://our.internmc.facebook.com/intern/diff/D62879500/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136202 Approved by: https://github.com/malfet	2024-09-18 16:57:15 +00:00
CaoE	6a6f5b20c5	Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936 ) Fixes #132613. Add `_addmm_activation` to lower precision cast policy on AutocastCPU. `_addmm_activation` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-09-18 16:31:27 +00:00
Isuru Fernando	c8d152cb0e	Fix fast_expand recursion error (#136163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136163 Approved by: https://github.com/ezyang	2024-09-18 13:58:45 +00:00
Sun, Jiayi	701ba5203f	[Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark correctness check (#135932 ) Fix https://github.com/pytorch/pytorch/issues/135657. Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check Pull Request resolved: https://github.com/pytorch/pytorch/pull/135932 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel	2024-09-18 13:03:45 +00:00
Prachi Gupta	b5be4d8c05	Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161 ) skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs. To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161 Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin	2024-09-18 11:01:23 +00:00
Menglu Yu	083c9149b7	Reland D62220158 (#136213 ) Summary: We fix the unit test test_pad_mm and reland the diff Test Plan: See in D62220158 Differential Revision: D62891584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136213 Approved by: https://github.com/dshi7	2024-09-18 07:33:41 +00:00
Jason Ansel	a0207c8471	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-18 04:47:51 +00:00
Nikita Shulga	9aa22eabe7	[CI] Make linux-aarch64 shards actually running different tests (#136208 ) Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests... Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman	2024-09-18 03:10:21 +00:00
Kiuk Chung	8895f69d12	[torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152 ) Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0. Changes in this PR: 1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x. 2. Do the same for `numpy.exceptions.VisibleDeprecationWarning` 3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0) 4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152 Approved by: https://github.com/atalman	2024-09-18 02:11:22 +00:00
Nikita Shulga	6682327c75	[BE] Make `NestedTensorTransformerFunctions.cu` compilable without warnings (#136222 ) Before the change compilation produced following warnings: ``` /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare] 584 \| TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims); \| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ ``` after it compiled without a warning Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-09-18 01:24:05 +00:00
leslie-fang-intel	b18ba9419e	[AO][Inductor] Enable WOQ fusion pattern with permute (#135928 ) Summary Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO. Test Plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-09-18 00:56:16 +00:00
Chirag Pandya	cccf500193	[c10d] remove sleep from watchdogHandler (#135760 ) Summary: Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout. Flight recorder is configured to take a minute, at most, to dump out it's buffer. This sleep ends up waiting for `8` minutes before destroy is called. Test Plan: Unit tests. Differential Revision: D62529875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760 Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang	2024-09-18 00:55:01 +00:00
Nikita Shulga	f6f1504d39	[MPS] Fix 5D+ reductions over negative dimentions (#136198 ) This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions Added regresion test case to `TestMPS.test_sum` Fixes https://github.com/pytorch/pytorch/issues/136132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198 Approved by: https://github.com/albanD	2024-09-17 21:53:31 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
eqy	e3aa5e2f64	[NCCL] Don't override `waitUntilInitialized`'s setting of `comm->initialized_` (#136155 ) #133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151 CC @shuqiangzhang @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155 Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang	2024-09-17 18:50:12 +00:00
Huanyu He	a4e9a1c90b	[TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045 ) Summary: # context * for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/) * basica idea of this diff is to short circuit the pytree flatten-unflatten function pairs between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict. NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545} * short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup. * hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users. # details * The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC. * a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns. WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`. # additional changes * absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`. * set `graph.owning_module` in export.unflatten as required by the graph modification * add one more layer of `sparse_module` for closely mimicing the APF model structure. Test Plan: # run test * serializer ``` buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer ``` * apf ``` buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir' ``` * local mp run ``` ==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ==== finished test_mtml_instagram_model_562438350_single_gpu_with_ir Imports took: 6.0s! Profile with --import-profiler. --_ \|""---__ Executed 1 example in 203.1s: \|'.\| \|\| . """\| Successful: 1 \| \|\| \|\| /\|\""-. \| Failed: 0 \| \|\| \|\| \| \| \| Skipped: 0 \| \|\| \|\| \| \\|/ \| Not executed: 8 \|."\| \|\| --"" '__\| https://testslide.readthedocs.io/ --" \|__---""" ``` Differential Revision: D62606738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045 Approved by: https://github.com/angelayi	2024-09-17 18:42:56 +00:00
angelayi	ea10c072f3	[export] Deserialize args with python keyword names (#136036 ) Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036 Approved by: https://github.com/zhxchen17	2024-09-17 18:13:14 +00:00
Joel Schlosser	a8382847f4	Support rms_norm() for NJT (#135872 ) `rms_norm()` is a nice-to-have for ViT :) This PR: * SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp. * Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side. * Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #125947	2024-09-17 18:09:20 +00:00
Nikita Shulga	785e98783b	Delete links to non-existing `run_plan_mpi.cc` (#136204 ) That were deleted by https://github.com/pytorch/pytorch/pull/125092 Fixes https://github.com/pytorch/pytorch/issues/136199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-09-17 17:51:56 +00:00
Trung Truong	cc365fdd7b	[MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889 ) Summary: Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn At the moment, both the major and minor version are just 0 Test Plan: Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/ Differential Revision: D62595296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889 Approved by: https://github.com/egienvalue	2024-09-17 17:42:56 +00:00
Xintong Hu	8e5bb356e0	[PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527 ) Summary: as title Test Plan: new UT Differential Revision: D62398390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527 Approved by: https://github.com/frank-wei	2024-09-17 17:26:53 +00:00
Nikhil Gupta	63dc5dff10	[Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857 ) Regression PR : https://github.com/pytorch/cpuinfo/pull/255 Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-09-17 16:50:17 +00:00
Justin Chu	67b14ce8bd	[ONNX] Fix numpy method to return the correct type (#136162 ) Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior. This needs to be cherry-picked into torch 2.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-17 15:51:00 +00:00
Mauricio Villegas	ece8267d2c	Add back optim type hints that were lost when .pyi files were removed (#136185 ) When stub files (`.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185 Approved by: https://github.com/janeyx99	2024-09-17 15:45:15 +00:00
Edward Z. Yang	913f97e878	Don't run reshape pattern match on dynamic shape size tensor (#136100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100 Approved by: https://github.com/mengluy0125	2024-09-17 15:08:55 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
PyTorch MergeBot	3b5e2689a1	Revert "Optimize dict reconstruct to not codegen untouched values (#134876 )" This reverts commit a1a57a424dc992f4dc2d44bdc1e4e7e500881a9c. Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))	2024-09-17 13:00:01 +00:00
ankurneog	e248c1d7eb	Update real device in FSDP state_dict_utils (#134994 ) ## Motivation The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor. ``` [rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical [rank3] sharded_tensor_sd = ref_model.state_dict() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict [rank3] hook_result = hook(self, destination, prefix, local_metadata) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3] return func(args, kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook [rank3] tensor.device, [rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper [rank3] return arg(args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__ [rank3] return dispatch(st_instance, func) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch [rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper [rank3] return wrapped_func(types, args, kwargs, process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device [rank3] dev = torch.device(torch.cuda.current_device()) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device [rank3] _lazy_init() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init [rank3] raise AssertionError("Torch not compiled with CUDA enabled") [rank3] AssertionError: Torch not compiled with CUDA enabled ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994 Approved by: https://github.com/fegin	2024-09-17 04:39:08 +00:00
wz337	408fe41a45	[DSD][EZ] Minor update in _state_dict_utils.py (#136165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763	2024-09-17 04:32:43 +00:00
Brian Hirsh	dc82d274e6	make view.dtype always return an alias (#136074 ) Fixes https://github.com/pytorch/pytorch/issues/136064 In the linked repro, this issue was that there was some code like this: ``` # x has dtype torch.float32 def f(x): y = x.view(torch.float32) y.copy_(...) ``` Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input. Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`). This does not happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set. This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input. I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #136041	2024-09-17 03:40:54 +00:00
Brian Hirsh	d463a81c27	inductor: dont use default_dtype during rng functionalization (#136041 ) Fixes https://github.com/pytorch/pytorch/issues/119162 See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041 Approved by: https://github.com/eellison	2024-09-17 03:40:54 +00:00
Zhijing Li (Accelerator Enablement)	3f74310784	Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 )" (#136160 ) Test Plan: make train-hstu-cint-publish-bf16-tgif-local Differential Revision: D62766335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160 Approved by: https://github.com/muchulee8	2024-09-17 01:06:10 +00:00
PyTorch MergeBot	37a08b33bb	Revert "fix compiled_autograd deadlock throw (#135795 )" This reverts commit 00dc7d435652ad66e9d2feb2660928b632281a98. Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))	2024-09-16 23:59:56 +00:00
Laith Sakka	071da87cd7	use csv extention for test report in order for it to be uploaded to s3 (#136128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128 Approved by: https://github.com/clee2000	2024-09-16 21:47:46 +00:00
Justin Chu	c12536b3c0	[ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153 ) Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed. ``` <OpOverload(op='aten.__and__', overload='Scalar')>, <OpOverload(op='aten.__and__', overload='Tensor')>, <OpOverload(op='aten.__or__', overload='Scalar')>, <OpOverload(op='aten.__or__', overload='Tensor')>, <OpOverload(op='aten.__xor__', overload='Scalar')>, <OpOverload(op='aten.__xor__', overload='Tensor')>, <OpOverload(op='aten._add_batch_dim', overload='default')>, <OpOverload(op='aten._assert_tensor_metadata', overload='default')>, <OpOverload(op='aten._backward', overload='default')>, <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>, <OpOverload(op='aten._cast_Byte', overload='default')>, <OpOverload(op='aten._cast_Char', overload='default')>, <OpOverload(op='aten._cast_Double', overload='default')>, <OpOverload(op='aten._cast_Float', overload='default')>, <OpOverload(op='aten._cast_Half', overload='default')>, <OpOverload(op='aten._cast_Int', overload='default')>, <OpOverload(op='aten._cast_Long', overload='default')>, <OpOverload(op='aten._cast_Short', overload='default')>, <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>, <OpOverload(op='aten._convolution', overload='deprecated')>, <OpOverload(op='aten._convolution_double_backward', overload='default')>, <OpOverload(op='aten._convolution_mode', overload='default')>, <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>, <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>, <OpOverload(op='aten._dim_arange', overload='default')>, <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>, <OpOverload(op='aten._gather_sparse_backward', overload='default')>, <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>, <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>, <OpOverload(op='aten._is_zerotensor', overload='default')>, <OpOverload(op='aten._lu_with_info', overload='default')>, <OpOverload(op='aten._nnpack_available', overload='default')>, <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>, <OpOverload(op='aten._pad_circular', overload='default')>, <OpOverload(op='aten._pad_enum', overload='default')>, <OpOverload(op='aten._pad_packed_sequence', overload='default')>, <OpOverload(op='aten._propagate_xla_data', overload='default')>, <OpOverload(op='aten._remove_batch_dim', overload='default')>, <OpOverload(op='aten._reshape_from_tensor', overload='default')>, <OpOverload(op='aten._rowwise_prune', overload='default')>, <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>, <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>, <OpOverload(op='aten._shape_as_tensor', overload='default')>, <OpOverload(op='aten._sobol_engine_draw', overload='default')>, <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_log_softmax', overload='int')>, <OpOverload(op='aten._sparse_mm', overload='default')>, <OpOverload(op='aten._sparse_mm', overload='reduce')>, <OpOverload(op='aten._sparse_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_softmax', overload='int')>, <OpOverload(op='aten._sparse_sum', overload='default')>, <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>, <OpOverload(op='aten._sparse_sum', overload='dtype')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>, <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>, <OpOverload(op='aten._test_check_tensor', overload='default')>, <OpOverload(op='aten._test_serialization_subcmul', overload='default')>, <OpOverload(op='aten._test_string_default', overload='default')>, <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._to_cpu', overload='default')>, <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>, <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>, <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>, <OpOverload(op='aten._version', overload='default')>, <OpOverload(op='aten._weight_norm', overload='default')>, <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>, <OpOverload(op='aten.absolute', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>, <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>, <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>, <OpOverload(op='aten.align_as', overload='default')>, <OpOverload(op='aten.align_tensors', overload='default')>, <OpOverload(op='aten.all', overload='dimname')>, <OpOverload(op='aten.any', overload='dimname')>, <OpOverload(op='aten.arccos', overload='default')>, <OpOverload(op='aten.arccosh', overload='default')>, <OpOverload(op='aten.arcsin', overload='default')>, <OpOverload(op='aten.arcsinh', overload='default')>, <OpOverload(op='aten.arctan', overload='default')>, <OpOverload(op='aten.arctan2', overload='default')>, <OpOverload(op='aten.arctanh', overload='default')>, <OpOverload(op='aten.argsort', overload='default')>, <OpOverload(op='aten.argsort', overload='dimname')>, <OpOverload(op='aten.argsort', overload='stable')>, <OpOverload(op='aten.argwhere', overload='default')>, <OpOverload(op='aten.atleast_1d', overload='Sequence')>, <OpOverload(op='aten.atleast_2d', overload='Sequence')>, <OpOverload(op='aten.atleast_3d', overload='Sequence')>, <OpOverload(op='aten.avg_pool1d', overload='default')>, <OpOverload(op='aten.bilinear', overload='default')>, <OpOverload(op='aten.broadcast_tensors', overload='default')>, <OpOverload(op='aten.can_cast', overload='default')>, <OpOverload(op='aten.cat', overload='names')>, <OpOverload(op='aten.cdist', overload='default')>, <OpOverload(op='aten.chain_matmul', overload='default')>, <OpOverload(op='aten.chalf', overload='default')>, <OpOverload(op='aten.choose_qparams_optimized', overload='default')>, <OpOverload(op='aten.clip', overload='Tensor')>, <OpOverload(op='aten.clip', overload='default')>, <OpOverload(op='aten.column_stack', overload='default')>, <OpOverload(op='aten.combinations', overload='default')>, <OpOverload(op='aten.concat', overload='default')>, <OpOverload(op='aten.concat', overload='names')>, <OpOverload(op='aten.concatenate', overload='default')>, <OpOverload(op='aten.concatenate', overload='names')>, <OpOverload(op='aten.conv1d', overload='default')>, <OpOverload(op='aten.conv1d', overload='padding')>, <OpOverload(op='aten.conv2d', overload='default')>, <OpOverload(op='aten.conv2d', overload='padding')>, <OpOverload(op='aten.conv3d', overload='default')>, <OpOverload(op='aten.conv3d', overload='padding')>, <OpOverload(op='aten.conv_tbc_backward', overload='default')>, <OpOverload(op='aten.conv_transpose1d', overload='default')>, <OpOverload(op='aten.conv_transpose2d', overload='input')>, <OpOverload(op='aten.conv_transpose3d', overload='input')>, <OpOverload(op='aten.corrcoef', overload='default')>, <OpOverload(op='aten.cosine_embedding_loss', overload='default')>, <OpOverload(op='aten.cosine_similarity', overload='default')>, <OpOverload(op='aten.cov', overload='default')>, <OpOverload(op='aten.cross', overload='default')>, <OpOverload(op='aten.cross_entropy_loss', overload='default')>, <OpOverload(op='aten.ctc_loss', overload='IntList')>, <OpOverload(op='aten.ctc_loss', overload='Tensor')>, <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>, <OpOverload(op='aten.cummax', overload='dimname')>, <OpOverload(op='aten.cummaxmin_backward', overload='default')>, <OpOverload(op='aten.cummin', overload='dimname')>, <OpOverload(op='aten.cumprod', overload='dimname')>, <OpOverload(op='aten.cumprod_backward', overload='default')>, <OpOverload(op='aten.cumsum', overload='dimname')>, <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>, <OpOverload(op='aten.cumulative_trapezoid', overload='x')>, <OpOverload(op='aten.data', overload='default')>, <OpOverload(op='aten.det', overload='default')>, <OpOverload(op='aten.diag', overload='default')>, <OpOverload(op='aten.diagflat', overload='default')>, <OpOverload(op='aten.diff', overload='default')>, <OpOverload(op='aten.divide', overload='Scalar')>, <OpOverload(op='aten.divide', overload='Scalar_mode')>, <OpOverload(op='aten.divide', overload='Tensor')>, <OpOverload(op='aten.divide', overload='Tensor_mode')>, <OpOverload(op='aten.dstack', overload='default')>, <OpOverload(op='aten.einsum', overload='default')>, <OpOverload(op='aten.embedding_backward', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='padding_idx')>, <OpOverload(op='aten.embedding_sparse_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>, <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>, <OpOverload(op='aten.fft_fft', overload='default')>, <OpOverload(op='aten.fft_fft2', overload='default')>, <OpOverload(op='aten.fft_fftn', overload='default')>, <OpOverload(op='aten.fft_fftshift', overload='default')>, <OpOverload(op='aten.fft_hfft', overload='default')>, <OpOverload(op='aten.fft_hfft2', overload='default')>, <OpOverload(op='aten.fft_hfftn', overload='default')>, <OpOverload(op='aten.fft_ifft', overload='default')>, <OpOverload(op='aten.fft_ifft2', overload='default')>, <OpOverload(op='aten.fft_ifftn', overload='default')>, <OpOverload(op='aten.fft_ifftshift', overload='default')>, <OpOverload(op='aten.fft_ihfft', overload='default')>, <OpOverload(op='aten.fft_ihfft2', overload='default')>, <OpOverload(op='aten.fft_ihfftn', overload='default')>, <OpOverload(op='aten.fft_irfft', overload='default')>, <OpOverload(op='aten.fft_irfft2', overload='default')>, <OpOverload(op='aten.fft_irfftn', overload='default')>, <OpOverload(op='aten.fft_rfft', overload='default')>, <OpOverload(op='aten.fft_rfft2', overload='default')>, <OpOverload(op='aten.fft_rfftn', overload='default')>, <OpOverload(op='aten.fix', overload='default')>, <OpOverload(op='aten.flatten_dense_tensors', overload='default')>, <OpOverload(op='aten.fliplr', overload='default')>, <OpOverload(op='aten.flipud', overload='default')>, <OpOverload(op='aten.float_power', overload='Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>, <OpOverload(op='aten.frobenius_norm', overload='dim')>, <OpOverload(op='aten.gather', overload='dimname')>, <OpOverload(op='aten.gather_backward', overload='default')>, <OpOverload(op='aten.ger', overload='default')>, <OpOverload(op='aten.gradient', overload='array')>, <OpOverload(op='aten.gradient', overload='scalararray')>, <OpOverload(op='aten.gradient', overload='scalarint')>, <OpOverload(op='aten.gradient', overload='scalarrayarray')>, <OpOverload(op='aten.gradient', overload='scalarrayint')>, <OpOverload(op='aten.gradient', overload='tensorarray')>, <OpOverload(op='aten.gradient', overload='tensorarrayint')>, <OpOverload(op='aten.greater', overload='Scalar')>, <OpOverload(op='aten.greater', overload='Tensor')>, <OpOverload(op='aten.greater_equal', overload='Scalar')>, <OpOverload(op='aten.greater_equal', overload='Tensor')>, <OpOverload(op='aten.grid_sampler', overload='default')>, <OpOverload(op='aten.group_norm', overload='default')>, <OpOverload(op='aten.gru', overload='data')>, <OpOverload(op='aten.gru', overload='input')>, <OpOverload(op='aten.gru_cell', overload='default')>, <OpOverload(op='aten.hinge_embedding_loss', overload='default')>, <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>, <OpOverload(op='aten.histogramdd', overload='default')>, <OpOverload(op='aten.histogramdd', overload='int_bins')>, <OpOverload(op='aten.hstack', overload='default')>, <OpOverload(op='aten.index_add', overload='dimname')>, <OpOverload(op='aten.index_copy', overload='dimname')>, <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>, <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>, <OpOverload(op='aten.index_select', overload='dimname')>, <OpOverload(op='aten.index_select_backward', overload='default')>, <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>, <OpOverload(op='aten.inner', overload='default')>, <OpOverload(op='aten.instance_norm', overload='default')>, <OpOverload(op='aten.inverse', overload='default')>, <OpOverload(op='aten.is_complex', overload='default')>, <OpOverload(op='aten.is_conj', overload='default')>, <OpOverload(op='aten.is_distributed', overload='default')>, <OpOverload(op='aten.is_floating_point', overload='default')>, <OpOverload(op='aten.is_inference', overload='default')>, <OpOverload(op='aten.is_leaf', overload='default')>, <OpOverload(op='aten.is_neg', overload='default')>, <OpOverload(op='aten.is_nonzero', overload='default')>, <OpOverload(op='aten.is_signed', overload='default')>, <OpOverload(op='aten.is_vulkan_available', overload='default')>, <OpOverload(op='aten.isclose', overload='default')>, <OpOverload(op='aten.isfinite', overload='default')>, <OpOverload(op='aten.isreal', overload='default')>, <OpOverload(op='aten.istft', overload='default')>, <OpOverload(op='aten.item', overload='default')>, <OpOverload(op='aten.kl_div', overload='default')>, <OpOverload(op='aten.kron', overload='default')>, <OpOverload(op='aten.kthvalue', overload='dimname')>, <OpOverload(op='aten.l1_loss', overload='default')>, <OpOverload(op='aten.layer_norm', overload='default')>, <OpOverload(op='aten.ldexp', overload='Tensor')>, <OpOverload(op='aten.less', overload='Scalar')>, <OpOverload(op='aten.less', overload='Tensor')>, <OpOverload(op='aten.less_equal', overload='Scalar')>, <OpOverload(op='aten.less_equal', overload='Tensor')>, <OpOverload(op='aten.linalg_cholesky', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='p_str')>, <OpOverload(op='aten.linalg_det', overload='default')>, <OpOverload(op='aten.linalg_eigh', overload='default')>, <OpOverload(op='aten.linalg_eigvals', overload='default')>, <OpOverload(op='aten.linalg_eigvalsh', overload='default')>, <OpOverload(op='aten.linalg_inv', overload='default')>, <OpOverload(op='aten.linalg_ldl_factor', overload='default')>, <OpOverload(op='aten.linalg_lu_factor', overload='default')>, <OpOverload(op='aten.linalg_matmul', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>, <OpOverload(op='aten.linalg_matrix_power', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>, <OpOverload(op='aten.linalg_matrix_rank', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>, <OpOverload(op='aten.linalg_multi_dot', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='ord_str')>, <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_pinv', overload='default')>, <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>, <OpOverload(op='aten.linalg_slogdet', overload='default')>, <OpOverload(op='aten.linalg_solve', overload='default')>, <OpOverload(op='aten.linalg_solve_ex', overload='default')>, <OpOverload(op='aten.linalg_svd', overload='default')>, <OpOverload(op='aten.linalg_svdvals', overload='default')>, <OpOverload(op='aten.linalg_tensorinv', overload='default')>, <OpOverload(op='aten.linalg_tensorsolve', overload='default')>, <OpOverload(op='aten.linalg_vander', overload='default')>, <OpOverload(op='aten.linalg_vecdot', overload='default')>, <OpOverload(op='aten.linear', overload='default')>, <OpOverload(op='aten.log_sigmoid', overload='default')>, <OpOverload(op='aten.log_softmax', overload='Dimname')>, <OpOverload(op='aten.log_softmax', overload='int')>, <OpOverload(op='aten.logcumsumexp', overload='dimname')>, <OpOverload(op='aten.logdet', overload='default')>, <OpOverload(op='aten.logsumexp', overload='names')>, <OpOverload(op='aten.lstm', overload='data')>, <OpOverload(op='aten.lstm', overload='input')>, <OpOverload(op='aten.lstm_cell', overload='default')>, <OpOverload(op='aten.lu_solve', overload='default')>, <OpOverload(op='aten.margin_ranking_loss', overload='default')>, <OpOverload(op='aten.masked_select_backward', overload='default')>, <OpOverload(op='aten.matmul', overload='default')>, <OpOverload(op='aten.matrix_exp', overload='default')>, <OpOverload(op='aten.matrix_exp_backward', overload='default')>, <OpOverload(op='aten.matrix_power', overload='default')>, <OpOverload(op='aten.max', overload='names_dim')>, <OpOverload(op='aten.max', overload='other')>, <OpOverload(op='aten.max_pool1d', overload='default')>, <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>, <OpOverload(op='aten.max_pool2d', overload='default')>, <OpOverload(op='aten.max_pool3d', overload='default')>, <OpOverload(op='aten.mean', overload='names_dim')>, <OpOverload(op='aten.median', overload='names_dim')>, <OpOverload(op='aten.meshgrid', overload='default')>, <OpOverload(op='aten.meshgrid', overload='indexing')>, <OpOverload(op='aten.min', overload='names_dim')>, <OpOverload(op='aten.min', overload='other')>, <OpOverload(op='aten.mish_backward', overload='default')>, <OpOverload(op='aten.mode', overload='dimname')>, <OpOverload(op='aten.msort', overload='default')>, <OpOverload(op='aten.multilabel_margin_loss', overload='default')>, <OpOverload(op='aten.multiply', overload='Scalar')>, <OpOverload(op='aten.multiply', overload='Tensor')>, <OpOverload(op='aten.nanmean', overload='default')>, <OpOverload(op='aten.nanmedian', overload='names_dim')>, <OpOverload(op='aten.nanquantile', overload='default')>, <OpOverload(op='aten.nanquantile', overload='scalar')>, <OpOverload(op='aten.native_channel_shuffle', overload='default')>, <OpOverload(op='aten.negative', overload='default')>, <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>, <OpOverload(op='aten.nll_loss', overload='default')>, <OpOverload(op='aten.nll_loss2d', overload='default')>, <OpOverload(op='aten.nll_loss_nd', overload='default')>, <OpOverload(op='aten.nonzero_numpy', overload='default')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>, <OpOverload(op='aten.norm_except_dim', overload='default')>, <OpOverload(op='aten.not_equal', overload='Scalar')>, <OpOverload(op='aten.not_equal', overload='Tensor')>, <OpOverload(op='aten.nuclear_norm', overload='default')>, <OpOverload(op='aten.nuclear_norm', overload='dim')>, <OpOverload(op='aten.one_hot', overload='default')>, <OpOverload(op='aten.orgqr', overload='default')>, <OpOverload(op='aten.outer', overload='default')>, <OpOverload(op='aten.output_nr', overload='default')>, <OpOverload(op='aten.pad', overload='default')>, <OpOverload(op='aten.pad_sequence', overload='default')>, <OpOverload(op='aten.pairwise_distance', overload='default')>, <OpOverload(op='aten.pdist', overload='default')>, <OpOverload(op='aten.pinverse', overload='default')>, <OpOverload(op='aten.poisson_nll_loss', overload='default')>, <OpOverload(op='aten.prelu', overload='default')>, <OpOverload(op='aten.prod', overload='dim_Dimname')>, <OpOverload(op='aten.promote_types', overload='default')>, <OpOverload(op='aten.qr', overload='default')>, <OpOverload(op='aten.quantile', overload='default')>, <OpOverload(op='aten.quantile', overload='scalar')>, <OpOverload(op='aten.quantized_gru_cell', overload='default')>, <OpOverload(op='aten.quantized_lstm_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.relu6', overload='default')>, <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>, <OpOverload(op='aten.repeat_interleave', overload='self_int')>, <OpOverload(op='aten.result_type', overload='Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>, <OpOverload(op='aten.result_type', overload='Tensor')>, <OpOverload(op='aten.retains_grad', overload='default')>, <OpOverload(op='aten.rms_norm', overload='default')>, <OpOverload(op='aten.rnn_relu', overload='data')>, <OpOverload(op='aten.rnn_relu', overload='input')>, <OpOverload(op='aten.rnn_relu_cell', overload='default')>, <OpOverload(op='aten.rnn_tanh', overload='data')>, <OpOverload(op='aten.rnn_tanh', overload='input')>, <OpOverload(op='aten.rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.row_stack', overload='default')>, <OpOverload(op='aten.rrelu', overload='default')>, <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>, <OpOverload(op='aten.scatter', overload='dimname_src')>, <OpOverload(op='aten.scatter', overload='dimname_value')>, <OpOverload(op='aten.scatter_add', overload='dimname')>, <OpOverload(op='aten.selu', overload='default')>, <OpOverload(op='aten.silu_backward', overload='default')>, <OpOverload(op='aten.size', overload='Dimname')>, <OpOverload(op='aten.size', overload='int')>, <OpOverload(op='aten.slogdet', overload='default')>, <OpOverload(op='aten.slow_conv3d', overload='default')>, <OpOverload(op='aten.smm', overload='default')>, <OpOverload(op='aten.softmax', overload='Dimname')>, <OpOverload(op='aten.softmax', overload='int')>, <OpOverload(op='aten.sort', overload='dimname')>, <OpOverload(op='aten.sort', overload='dimname_stable')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.special_digamma', overload='default')>, <OpOverload(op='aten.special_erf', overload='default')>, <OpOverload(op='aten.special_erfc', overload='default')>, <OpOverload(op='aten.special_erfinv', overload='default')>, <OpOverload(op='aten.special_exp2', overload='default')>, <OpOverload(op='aten.special_expit', overload='default')>, <OpOverload(op='aten.special_expm1', overload='default')>, <OpOverload(op='aten.special_gammainc', overload='default')>, <OpOverload(op='aten.special_gammaincc', overload='default')>, <OpOverload(op='aten.special_gammaln', overload='default')>, <OpOverload(op='aten.special_i0', overload='default')>, <OpOverload(op='aten.special_log1p', overload='default')>, <OpOverload(op='aten.special_log_softmax', overload='default')>, <OpOverload(op='aten.special_logit', overload='default')>, <OpOverload(op='aten.special_logsumexp', overload='default')>, <OpOverload(op='aten.special_multigammaln', overload='default')>, <OpOverload(op='aten.special_ndtr', overload='default')>, <OpOverload(op='aten.special_polygamma', overload='default')>, <OpOverload(op='aten.special_psi', overload='default')>, <OpOverload(op='aten.special_round', overload='default')>, <OpOverload(op='aten.special_sinc', overload='default')>, <OpOverload(op='aten.special_softmax', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='other_scalar')>, <OpOverload(op='aten.special_xlogy', overload='self_scalar')>, <OpOverload(op='aten.square', overload='default')>, <OpOverload(op='aten.sspaddmm', overload='default')>, <OpOverload(op='aten.std', overload='correction_names')>, <OpOverload(op='aten.std', overload='default')>, <OpOverload(op='aten.std', overload='dim')>, <OpOverload(op='aten.std', overload='names_dim')>, <OpOverload(op='aten.std_mean', overload='correction_names')>, <OpOverload(op='aten.std_mean', overload='default')>, <OpOverload(op='aten.std_mean', overload='dim')>, <OpOverload(op='aten.std_mean', overload='names_dim')>, <OpOverload(op='aten.stft', overload='center')>, <OpOverload(op='aten.stft', overload='default')>, <OpOverload(op='aten.stride', overload='Dimname')>, <OpOverload(op='aten.stride', overload='int')>, <OpOverload(op='aten.subtract', overload='Scalar')>, <OpOverload(op='aten.subtract', overload='Tensor')>, <OpOverload(op='aten.sum', overload='dim_DimnameList')>, <OpOverload(op='aten.sum_to_size', overload='default')>, <OpOverload(op='aten.svd', overload='default')>, <OpOverload(op='aten.sym_size', overload='int')>, <OpOverload(op='aten.sym_stride', overload='int')>, <OpOverload(op='aten.take_along_dim', overload='default')>, <OpOverload(op='aten.tensordot', overload='default')>, <OpOverload(op='aten.thnn_conv2d', overload='default')>, <OpOverload(op='aten.tile', overload='default')>, <OpOverload(op='aten.to_dense', overload='default')>, <OpOverload(op='aten.to_dense_backward', overload='default')>, <OpOverload(op='aten.to_mkldnn_backward', overload='default')>, <OpOverload(op='aten.to_sparse', overload='default')>, <OpOverload(op='aten.to_sparse', overload='sparse_dim')>, <OpOverload(op='aten.to_sparse_bsc', overload='default')>, <OpOverload(op='aten.to_sparse_bsr', overload='default')>, <OpOverload(op='aten.to_sparse_csc', overload='default')>, <OpOverload(op='aten.to_sparse_csr', overload='default')>, <OpOverload(op='aten.trace_backward', overload='default')>, <OpOverload(op='aten.trapezoid', overload='dx')>, <OpOverload(op='aten.trapezoid', overload='x')>, <OpOverload(op='aten.trapz', overload='dx')>, <OpOverload(op='aten.trapz', overload='x')>, <OpOverload(op='aten.triplet_margin_loss', overload='default')>, <OpOverload(op='aten.true_divide', overload='Scalar')>, <OpOverload(op='aten.true_divide', overload='Tensor')>, <OpOverload(op='aten.type_as', overload='default')>, <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>, <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>, <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>, <OpOverload(op='aten.upsample_linear1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest1d', overload='default')>, <OpOverload(op='aten.upsample_nearest1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest2d', overload='default')>, <OpOverload(op='aten.upsample_nearest2d', overload='vec')>, <OpOverload(op='aten.upsample_nearest3d', overload='default')>, <OpOverload(op='aten.upsample_nearest3d', overload='vec')>, <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>, <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>, <OpOverload(op='aten.vander', overload='default')>, <OpOverload(op='aten.var', overload='correction_names')>, <OpOverload(op='aten.var', overload='default')>, <OpOverload(op='aten.var', overload='dim')>, <OpOverload(op='aten.var', overload='names_dim')>, <OpOverload(op='aten.var_mean', overload='correction_names')>, <OpOverload(op='aten.var_mean', overload='default')>, <OpOverload(op='aten.var_mean', overload='dim')>, <OpOverload(op='aten.var_mean', overload='names_dim')>, <OpOverload(op='aten.vstack', overload='default')>, <OpOverload(op='aten.where', overload='Scalar')>, <OpOverload(op='aten.where', overload='ScalarOther')>, <OpOverload(op='aten.where', overload='ScalarSelf')>, <OpOverload(op='aten.where', overload='default')>, <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>, <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153 Approved by: https://github.com/xadupre, https://github.com/gramalingam	2024-09-16 21:28:54 +00:00
Pearu Peterson	b76d1b79e6	Add scaling arguments to bsr_dense_addmm (#136104 ) As in the title. Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413 The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task. Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous. Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104 Approved by: https://github.com/cpuhrsch	2024-09-16 20:26:54 +00:00
PyTorch MergeBot	bfbcdf4967	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit c64ae601ba9eb3ad2cd3402a14f6ac83c0ab7eba. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))	2024-09-16 20:26:35 +00:00
Dan Johnson	3c97b0ab00	Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499 ) NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported. Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-09-16 20:08:06 +00:00
Kiuk Chung	abd16a8c64	[torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030 ) Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()` https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030 Approved by: https://github.com/albanD	2024-09-16 20:07:29 +00:00
fduwjj	a0c7029a75	[c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) (#135653 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. This is try to reland D62008954 by fixing internal errors. Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653 Approved by: https://github.com/wz337, https://github.com/H-Huang	2024-09-16 19:56:42 +00:00
James Wu	7537f74277	Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491 ) Summary: We refactor FxGraphCache.load into three phases: - prepare_key, which checks that an inductor input is cacheable and bypasses otherwise - load_with_key, which tries to lookup the key in the cache - post compile, where we do some logging and run post compile steps Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc. Differential Revision: D62314862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491 Approved by: https://github.com/oulgen	2024-09-16 19:48:08 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
Nikita Shulga	38caf10411	[EZ] Fix spelling typo (#136157 ) s/toosl/tools/ (spotted by @louie-tsai) Also, capitalize CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157 Approved by: https://github.com/kit1980	2024-09-16 19:30:30 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
eugenekoran	717fca2cac	Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146 ) Fixes #125920 [Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146 Approved by: https://github.com/janeyx99	2024-09-16 19:02:21 +00:00
Alexander Kurakin	f89ce4dfbb	`torch.nn.MultiheadAttention`: docs: improvement (#136111 ) `torch.nn.MultiheadAttention`: docs: improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111 Approved by: https://github.com/janeyx99	2024-09-16 18:52:20 +00:00
Nikita Shulga	d3647d15e6	Remove accidentally committed code (#136154 ) Accidentally left out during rebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-09-16 18:34:20 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
PyTorch MergeBot	7fe004f7cf	Revert "Add CI for Triton CPU backend (#135342 )" This reverts commit 426580a67db15ec17b2b861a09667bf59927e033. Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Aaron Gokaslan	23c0d2689e	[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091 ) Testing if op info coverage has issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091 Approved by: https://github.com/ezyang	2024-09-16 18:22:16 +00:00
Suresh Babu Kolla	5193f23469	[Pytorch] Cleanup Strobelight URL and shorten for readability (#136102 ) Summary: - Converted strobelight URL prefix to more readable and editable json - Dump shortened URLs when possible for easier readability Test Plan: ``` python ./torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py ``` Differential Revision: D62690292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102 Approved by: https://github.com/laithsakka	2024-09-16 18:10:33 +00:00
PyTorch MergeBot	0199fd4d7e	Revert "[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 )" This reverts commit e54b559e8860e343692bb5534777b2384a57a613. Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))	2024-09-16 17:58:02 +00:00
Aaron Gokaslan	b491e2974c	[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114 ) Fixes #136090 * Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches). * Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique. * This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114 Approved by: https://github.com/malfet	2024-09-16 17:49:12 +00:00
Justin Chu	0aa41eb52f	[ONNX] Run type promotion test in CI and update the table (#135915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-16 16:46:13 +00:00
IvanKobzarev	090046b936	[effects] Turn off dtype promotion for with_effects lowering (#136039 ) By default inductor promotes arguments to the common highest dtype. Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects. Disabling dtype promotion for this lowering. Removing previous workaround making token dtype torch.bool. Testing: ``` python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039 Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519	2024-09-16 16:14:05 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Jon Janzen	13bd1256f9	Delete stable prototype (#135911 ) This project ended up going in an entirely different direction, so we can close out all this Pull Request resolved: https://github.com/pytorch/pytorch/pull/135911 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2024-09-16 15:32:17 +00:00
Bin Bao	d833f49602	[reland][Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#136046 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues Test Plan: CI Differential Revision: D62658837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046 Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel	2024-09-16 14:35:19 +00:00
Bin Bao	a803cb0531	[AOTI] Refactor how cpp_wrapper specific options are set (#136035 ) Summary: 1) When cpp-wrapper is turned on, certain triton specific options need to be set, both for forward and backward. This PR considate the settings in one place. 2) Change config.triton.autotune_at_compile_time to default to None. If the flag is not explicitly set by user, default it to True for cpp-wrapper. Differential Revision: [D62689940](https://our.internmc.facebook.com/intern/diff/D62689940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136035 Approved by: https://github.com/chenyang78	2024-09-16 14:32:13 +00:00
atalman	bbc3fdbbde	Add python 3.13.0t build to Docker images (#136001 ) Adds 3.13t python to Docker images Pull Request resolved: https://github.com/pytorch/pytorch/pull/136001 Approved by: https://github.com/albanD	2024-09-16 12:49:36 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Xuehai Pan	951c21d679	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133778	2024-09-16 04:53:06 +00:00
Xuehai Pan	9961aaa601	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-16 04:53:06 +00:00
Ke Wen	d2207c57f7	[Distributed] add pack-check method for float8_e5m2 (#136115 ) Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check). Made `HasNanFP8x8` a template so that it is extendable based on dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136115 Approved by: https://github.com/Skylion007 ghstack dependencies: #135891, #135961	2024-09-15 21:37:43 +00:00
Howard Huang	e501ed71d4	Update link in distributed.tensor.parallel.rst (#136103 ) dtensor folder was moved Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-09-15 19:36:29 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
Andrii Grynenko	a141c6bb0d	[pytorch][monitoring] Dynamic backend for WaitCounter (#135967 ) Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application. Differential Revision: D62549295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967 Approved by: https://github.com/c-p-i-o	2024-09-15 18:07:49 +00:00
Tugsbayasgalan Manlaibaatar	dec3403b24	Add some doc for export_for_training (#135918 ) Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080, #135912	2024-09-15 17:08:12 +00:00
Tugsbayasgalan Manlaibaatar	1904b09e61	Create export_for_inference API and expose core_aten as public facing API (#135912 ) Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080	2024-09-15 17:05:07 +00:00
Tugsbayasgalan Manlaibaatar	382fad58b3	Deprecate _preserve_ops and consolidate with decomp_table (#135080 ) In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it. After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag. Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh	2024-09-15 17:01:58 +00:00
PyTorch MergeBot	357b7fb579	Revert "[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 )" This reverts commit b8637503c036abb898f6b880b325aeffe6f09c03. Reverted https://github.com/pytorch/pytorch/pull/135953 on behalf of https://github.com/kollasb due to Broke internal module factory compatibility, revert from Phabricator failed ([comment](https://github.com/pytorch/pytorch/pull/135953#issuecomment-2351381777))	2024-09-15 05:32:38 +00:00
cyy	31e42a45dd	Fix redundant move warnings by g++ (#134987 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134987 Approved by: https://github.com/ezyang	2024-09-15 05:28:19 +00:00
PyTorch UpdateBot	e1abd346a3	[audio hash update] update the pinned audio hash (#136106 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136106 Approved by: https://github.com/pytorchbot	2024-09-15 04:31:35 +00:00
Will Feng	386884e553	[Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Support FSDP2 + AC (#134997 ) > Ignore FSDP2 forward hook side-effects in AC Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job: `451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)` So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation. ---- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997 Approved by: https://github.com/zou3519 ghstack dependencies: #135727	2024-09-15 02:00:17 +00:00
leslie-fang-intel	8072ebc36c	SKIP llama for dynamic size testing (#135960 ) Running Torchbench llama with dynamic size failed with ``` File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32). ``` Skip this model for marking dynamic dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960 Approved by: https://github.com/ezyang	2024-09-15 00:06:49 +00:00
Guilherme Leobas	a1a57a424d	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-14 23:25:28 +00:00
Bob Ren	a5eb43d8b4	Add TensorReferenceAnalysis and some tests (#135886 ) Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886 Approved by: https://github.com/ezyang	2024-09-14 23:09:40 +00:00
Isuru Fernando	391f2d6d50	use a fast expand algorithm (#135999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135999 Approved by: https://github.com/ezyang	2024-09-14 23:09:34 +00:00
Isuru Fernando	5b21d91197	Fix dividing Mul by factor (#136079 ) Fixes https://github.com/pytorch/pytorch/issues/136032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136079 Approved by: https://github.com/ezyang	2024-09-14 22:14:27 +00:00
Jez Ng	426580a67d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel ghstack dependencies: #133408	2024-09-14 21:45:19 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
Jason Ansel	c64ae601ba	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-14 21:00:41 +00:00
Aaron Gokaslan	7f5abb44af	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 ) Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087 Approved by: https://github.com/malfet	2024-09-14 20:48:44 +00:00
Michael Lazos	8df01c8258	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 18:52:22 +00:00
Michael Lazos	860838e9be	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 18:52:22 +00:00
Michael Lazos	1b9daeb240	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 18:52:22 +00:00
Michael Lazos	06caa2d560	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 18:52:22 +00:00
Michael Lazos	14cabdf626	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 18:52:22 +00:00
Michael Lazos	5c5c33ac32	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 18:52:22 +00:00
Michael Lazos	228760b945	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 18:52:22 +00:00
Bin Bao	b4c84c3167	[AOTI] Fix a fallback op returning None issue (#135997 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor. Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997 Approved by: https://github.com/chenyang78	2024-09-14 18:12:06 +00:00
Laith Sakka	b82122beef	Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730 ) All of the previous benchmarks are similar, ListOfLinears should be representative enough. I copied the previous benchmarks from unit tests without an intention, was just trying to create a large number of benchmarks to better observe noise. This PR keeps only one, we can add more as we see value and regressions in the future. Also this diff adds a GPU version. ``` collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 6479525851 compile time instruction count for iteration 1 is 1024432680 compile time instruction count for iteration 2 is 1019417317 compile time instruction count for iteration 3 is 1013603566 compile time instruction count for iteration 4 is 1008853980 compile time instruction count for iteration 5 is 1009541481 compile time instruction count for iteration 6 is 1005025533 compile time instruction count for iteration 7 is 1004116323 compile time instruction count for iteration 8 is 1000828633 compile time instruction count for iteration 9 is 999788323 collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 40837529730 compile time instruction count for iteration 1 is 18411921909 compile time instruction count for iteration 2 is 18383665161 compile time instruction count for iteration 3 is 18348983522 compile time instruction count for iteration 4 is 18349276590 compile time instruction count for iteration 5 is 18353046274 compile time instruction count for iteration 6 is 18346818581 compile time instruction count for iteration 7 is 18340057998 compile time instruction count for iteration 8 is 18331267320 compile time instruction count for iteration 9 is 18328381338 collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu compile time instruction count for iteration 0 is 15408870979 compile time instruction count for iteration 1 is 10949520859 compile time instruction count for iteration 2 is 11058786167 compile time instruction count for iteration 3 is 11003606719 compile time instruction count for iteration 4 is 10896406770 compile time instruction count for iteration 5 is 10982875189 compile time instruction count for iteration 6 is 10931848275 compile time instruction count for iteration 7 is 10956345008 compile time instruction count for iteration 8 is 11045384499 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 16:45:52 +00:00
Suresh Babu Kolla	b8637503c0	[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 ) Summary: Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that. - Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based) - Both OSS and Fbcode now use one compile time profiler in torch/_strobelight Test Plan: Tested OSS with following commands: ``` python torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` See test commands for fbcode in comments. Differential Revision: D62444551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953 Approved by: https://github.com/laithsakka	2024-09-14 16:35:22 +00:00
William Wen	f97cccf62a	[3.13] fix 3.13 pickle error in torch/package (#136049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136049 Approved by: https://github.com/albanD ghstack dependencies: #136034	2024-09-14 14:28:09 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	23dec79cef	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 731b178b56c83966d6e8cdfb0015d22d8f91b4d2. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	8c8a3086a7	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 4528777e034b157a8329d1879daf52290eea199a. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	46f5037007	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 149d0b716173787df4543186ff74b605aca54e3e. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	7975ec3a29	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit ce3c74f2744cbc134b95cf8bd53ae5e3fbc67c29. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	f3180f0088	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	838c912502	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 5c67cf180ee53d696f95d7c45dd99a35399e4450. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	72b868d034	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit e77bd0ebd20e96990ccd40518e68bbcfe7fda855. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:54 +00:00
Zhenbin Lin	41b58a1bec	OpenReg: Fix issue when copying on the same device (#135956 ) Current copy gets wrong value when src and dst are both openreg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135956 Approved by: https://github.com/albanD	2024-09-14 09:57:45 +00:00
CaoE	f96a073c9d	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 09:53:17 +00:00
Will Feng	a815611db9	[Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727 ) If node is AC region output and has a backward hook on it, we intentionally choose to save it. This is to work around circular dependencies in Traceable FSDP2+AC. Example: ``` out = fully_shard(utils.checkpoint(module))(x) norm_out = layer_norm(out) ``` and there is a circular dependency: 1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`. 2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed. 3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad` -> circular dependency with (1)! Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency. ---- Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727 Approved by: https://github.com/Chillee	2024-09-14 08:45:58 +00:00
Oguz Ulgen	3352c9ac94	Add higher order operator name to the cache bypass exception (#135876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2024-09-14 07:05:29 +00:00
Will Feng	5a2be192d1	[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824 ) During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824 Approved by: https://github.com/awgu	2024-09-14 06:30:12 +00:00
Nikita Shulga	a9bef85263	[CI] Increase open file handles limit to 16K on MacOS (#136061 ) May be it will help with flaky failures tracked in https://github.com/pytorch/pytorch/issues/135885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136061 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn, https://github.com/ZainRizvi	2024-09-14 06:16:12 +00:00
Laith Sakka	44dd218a61	Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768 ) When we measure compile time instruction count, probably we do want in most cases to measure gc instructions disabling it here by default. if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 06:15:28 +00:00
Nikita Shulga	1a67e2b680	[MPS] Add native im2col (#135706 ) It's called from `torch.unfold` and one of the few remaining vestiges in `MPSFallback.mm` Strongly inspired by CUDA implementation from `09519eb195/aten/src/ATen/native/cuda/im2col.cuh (L40-L61)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135706 Approved by: https://github.com/albanD	2024-09-14 06:09:36 +00:00
Jack Taylor	b9b6094793	[ROCm] Skip pointwise associative scan tests due to regression (#135995 ) https://github.com/pytorch/pytorch/pull/133012 caused a regression on ROCm causing pointwise scan tests to fail ``` ERROR: test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_tuple_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda ``` Skipping temporarily while triage is underway. Full log: https://ossci-raw-job-status.s3.amazonaws.com/log/30067645445 ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/graph.py", line 1020, in call_function out = lowerings[target](args, kwargs) # type: ignore[index] File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 363, in wrapped out = decomp_fn(args, **kwargs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 6245, in associative_scan raise RuntimeError("Unable to generate code for associative_scan op") torch._inductor.exc.LoweringException: RuntimeError: Unable to generate code for associative_scan op ``` NOTE: even "eager" backend fails ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_higher_order_ops/associative_scan.py", line 338, in associative_scan_op_dense raise NotImplementedError("associative_scan is not implemented for eager") NotImplementedError: associative_scan is not implemented for eager ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135995 Approved by: https://github.com/malfet	2024-09-14 05:40:10 +00:00
fduwjj	911a43f930	[TCPStore] Remove deprecated constructor (#136004 ) While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion. Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-09-14 04:25:47 +00:00
Michael Lazos	e77bd0ebd2	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 02:41:16 +00:00
Michael Lazos	5c67cf180e	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 02:41:16 +00:00
Michael Lazos	7743149b2b	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 02:41:08 +00:00
Michael Lazos	ce3c74f274	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 02:40:59 +00:00
Michael Lazos	149d0b7161	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 02:40:52 +00:00
Michael Lazos	4528777e03	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 02:40:43 +00:00
Michael Lazos	731b178b56	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 02:40:32 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit 51c52061339069a2162e921e5b464fad5a411522. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
Yu, Guangye	2e8d431a8f	Fix tensor.data_ptr() representation overflow (#135567 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135550 In PyTorch, [`tensor.data_ptr()`](`e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)`) is reinterpreted by a [signed int64](`e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)`) data type, which could result in an overflow issue, like below: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is -23453392437248 # this is inconsistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](`c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)`). With this PR, the output will become: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is 18446720620317114368 # this is consistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` # Solution Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD	2024-09-14 01:52:04 +00:00
Nikita Shulga	95496e4855	[CI] Check that PyTorch is built with OpenMP (#136060 ) Restriction for x86 only builds should have been removed long time ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/136060 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/ZainRizvi	2024-09-14 01:51:36 +00:00
Li, Xingyuan	5de4cb8cd8	[Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135827 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-09-14 01:43:05 +00:00
Joel Schlosser	06bc717410	Fix sum() forward for NJT (#131945 ) This PR solves two problems with `sum()` support in NJT: * `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519. * Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945 Approved by: https://github.com/davidberard98, https://github.com/jananisriram	2024-09-14 00:58:03 +00:00
Nikita Shulga	081c4a966d	[BE] Use squeeze/unsqueeze in im2col (#136006 ) And move unsqeeze out of the dispatch, as it's dtype agnostic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136006 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-14 00:35:37 +00:00
Ke Wen	4237592b8f	[Distributed] add pack-check method for float8_e4m3fn (#135961 ) We check 8 x FP8 simultaneously, at size of 8 bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135961 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #135891	2024-09-14 00:32:27 +00:00
William Wen	a00faf4408	[3.13] fix 3.13 pickle error in serialization.py (#136034 ) Error encountered when adding dynamo 3.13 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136034 Approved by: https://github.com/albanD	2024-09-14 00:02:40 +00:00
eellison	b608ff3bea	[Easy] Dont match to mm_plus_mm if not in max autotune (#135929 ) It's only an optimization when we tune the triton template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135929 Approved by: https://github.com/FindHao	2024-09-13 23:38:02 +00:00
Jerry Zhang	b8eef500a6	Fix attr check for quantization spec (#135736 ) Summary: Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max This PR added checks for other fields as well Test Plan: regression tests Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736 Approved by: https://github.com/sxu	2024-09-13 23:01:22 +00:00
Menglu Yu	aad556a0b5	[PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_one (#135962 ) Summary: see context in https://fb.workplace.com/groups/1075192433118967/permalink/1501768230461383/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "mai" --flow_id 642153776 ``` P1586356950 # e2e before fix f642153776 after fix Differential Revision: D62625318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135962 Approved by: https://github.com/jackiexu1992	2024-09-13 22:53:08 +00:00
Zain Rizvi	3c5d44dda5	Cleanup unused runner variants (#136058 ) Cleaning up unused runner variants, leaving behind only the few that are actually referenced by workflows For more details see description in the PR that generated these code changes: - https://github.com/pytorch/test-infra/pull/5665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136058 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-09-13 22:50:07 +00:00
Justin Chu	e2d3af405f	[ONNX] Remove logging apis from public (#133825 ) Remove - torch.onnx.enable_log - torch.onnx.disable_log - torch.onnx.set_log_stream - torch.onnx.log Because they are not meant for public consumption and has been marked for deprecation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825 Approved by: https://github.com/titaiwangms	2024-09-13 22:19:52 +00:00
Jessica Vandebon	baff86dafb	[MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871 ) Reviewed By: egienvalue, hanzlfs Differential Revision: D61662214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135871 Approved by: https://github.com/egienvalue, https://github.com/nautsimon	2024-09-13 22:13:58 +00:00
Huy Do	db5e1b44d2	Fix inductor-micro-benchmark results upload (take 2) (#136052 ) I had a brain freeze when I wrote the original fix. The parameters were in the wrong order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136052 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet	2024-09-13 22:05:10 +00:00
Nikita Shulga	a30d5ba16c	Fix bug in split-build workflows codegen (#136043 ) By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510 If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-13 21:29:06 +00:00
Laith Sakka	46935c8241	Reduce default iterations to 5 . (#135773 ) running all benchmarks takes around 15 mins rn, this is the data https://www.internalfb.com/phabricator/paste/view/P1583590240 the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold. that said, the diff also add a way to increase the number of iterations for a specific benchmark. after the change results https://www.internalfb.com/phabricator/paste/view/P1583618969 time is down to half (7 mins) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 21:16:38 +00:00
Laith Sakka	4f407c1884	Only measure compile time instruction count for sum_floordiv benchmark (#135785 ) there was a recent strange noise +5%, -5%. using only compile time : 1) avoid gc time . 2) avoid other operations that are not what we try to measure by this. ==> less probable noise. ``` collecting compile time instruction count for sum_floordiv_regression compile time instruction count for iteration 0 is 8899290248 compile time instruction count for iteration 1 is 1188830489 compile time instruction count for iteration 2 is 1180579615 compile time instruction count for iteration 3 is 1176263131 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785 Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305	2024-09-13 21:14:10 +00:00
Laith Sakka	2e461e54e8	Add gpu and gpu_dynamic versions of add_loop (#135809 ) I am thinking maybe 3 iterations are enough for this one? - so I am keeping eager and inductor since inductor is 2X eager time - Eager dynamic is 2X eager so keeping this as well. - inductor have three tests. (dynamic gpu, gpu and cpu) I am unsure if am over profiling here happy to trim if anyone have suggestions. ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8213664211 compile time instruction count for iteration 1 is 2798628246 compile time instruction count for iteration 2 is 2796811362 compile time instruction count for iteration 3 is 2794438188 compile time instruction count for iteration 4 is 2794634117 collecting compile time instruction count for add_loop_eager_dynamic compile time instruction count for iteration 0 is 5724108021 compile time instruction count for iteration 1 is 5499908609 compile time instruction count for iteration 2 is 5569101366 compile time instruction count for iteration 3 is 5493806364 compile time instruction count for iteration 4 is 5493169851 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 49789381222 compile time instruction count for iteration 1 is 25769347393 compile time instruction count for iteration 2 is 25772594322 compile time instruction count for iteration 3 is 25768695952 compile time instruction count for iteration 4 is 25768032314 collecting compile time instruction count for add_loop_inductor_gpu compile time instruction count for iteration 0 is 23966942581 compile time instruction count for iteration 1 is 23771950919 compile time instruction count for iteration 2 is 23770784286 compile time instruction count for iteration 3 is 23780160875 compile time instruction count for iteration 4 is 23774634465 collecting compile time instruction count for add_loop_inductor_dynamic_gpu compile time instruction count for iteration 0 is 41505055086 compile time instruction count for iteration 1 is 41293654089 compile time instruction count for iteration 2 is 41301016100 compile time instruction count for iteration 3 is 41306056207 compile time instruction count for iteration 4 is 41308171566 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 20:42:31 +00:00
atalman	a3d827a28c	Use python 3.11 for Large Wheel build (#136042 ) Use Python 3.11 in nightly Large wheel builds. Required for Colab testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042 Approved by: https://github.com/kit1980, https://github.com/malfet Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>	2024-09-13 20:27:11 +00:00
Yiming Zhou	4312794b92	[reland][export] fix re-export custom metadata (#135720 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134778 The previous D62304294 broke some executorch tests. It has already been reverted. In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle ``` Differential Revision: D62514208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720 Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168	2024-09-13 20:15:15 +00:00
Sergii Dymchenko	b856f3539b	Fix script name in the comments (#135507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507 Approved by: https://github.com/atalman	2024-09-13 19:59:47 +00:00
Jing Xu	835e7bb077	fix requirements.txt installation failure issue on Windows (#134567 ) Fixes #134564 Root cause: The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries.. ![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683) Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below. ```bash >python -m pip install lintrunner Collecting lintrunner Downloading lintrunner-0.12.5.tar.gz (62 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: lintrunner Building wheel for lintrunner (pyproject.toml) ... error error: subprocess-exited-with-error × Building wheel for lintrunner (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [137 lines of output] Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off` ðŸ“¡ Using build options bindings from pyproject.toml Compiling proc-macro2 v1.0.79 Compiling unicode-ident v1.0.12 Compiling version_check v0.9.4 Compiling windows_x86_64_msvc v0.52.4 Compiling winapi v0.3.9 Compiling serde v1.0.197 Compiling autocfg v1.2.0 Compiling syn v1.0.109 Compiling lazy_static v1.4.0 Compiling libc v0.2.153 Compiling equivalent v1.0.1 Compiling hashbrown v0.14.3 Compiling memchr v2.7.2 Compiling yansi v1.0.1 Compiling unicode-width v0.1.11 Compiling regex-syntax v0.8.3 Compiling encode_unicode v0.3.6 Compiling cfg-if v1.0.0 Compiling winnow v0.6.5 Compiling cc v1.0.92 error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors warning: build failed, waiting for other jobs to finish... error: could not compile `serde` (build script) due to 2 previous errors error: could not compile `proc-macro2` (build script) due to 2 previous errors error: could not compile `syn` (build script) due to 2 previous errors error: could not compile `libc` (build script) due to 2 previous errors error: could not compile `winapi` (build script) due to 2 previous errors ðŸ’¥ maturin failed Caused by: Failed to build a native library through cargo Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --` ðŸ“¦ Including license file "LICENSE" ðŸ”— Found bin bindings error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for lintrunner Failed to build lintrunner ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567 Approved by: https://github.com/malfet	2024-09-13 18:43:55 +00:00
PyTorch MergeBot	b6d6aa49b8	Revert "Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 )" This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0. Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))	2024-09-13 18:06:56 +00:00
PyTorch MergeBot	deee21cb78	Revert "[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 )" This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac. Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))	2024-09-13 17:53:21 +00:00
Daohang Shi	3f69410976	[gpu-profiler] Expose active and repeat in os env var (#135757 ) Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/ Test Plan: `buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api ` eyes Differential Revision: D62529249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757 Approved by: https://github.com/Yuzhen11	2024-09-13 17:48:27 +00:00
PyTorch MergeBot	18f9331e5d	Revert "[aoti] Fix workspace generation for triton (#135552 )" This reverts commit d3833253928f29ed760b2dccac2b730028a868ca. Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))	2024-09-13 17:47:36 +00:00
Catherine Lee	bc0f330169	[trymerge] Manually close merged PR when Github fails (#135890 ) Manually close merged PR when Github fails to do it. Consequences of current design: Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job) Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num" Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-13 17:29:24 +00:00
Rachel Guo	7834c0bb2c	[AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887 ) Summary: As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well. The inductor python wrapper code level printing would look something like this: {F1859224287} Test Plan: CI Reviewed By: chenyang78 Differential Revision: D62415575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887 Approved by: https://github.com/chenyang78	2024-09-13 17:19:25 +00:00
PyTorch MergeBot	6ef49fe8f1	Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 )" This reverts commit 3d2431380999252d5401f83d5010b398a32e7597. Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))	2024-09-13 17:09:45 +00:00
Jack Taylor	a15774563b	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-09-13 16:45:39 +00:00
PyTorch MergeBot	564d00f364	Revert "Fix clang-tidy warnings in Caffe2 code (#134935 )" This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d. Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))	2024-09-13 16:42:37 +00:00
drisspg	ae02d663cd	[FlexAttention] Fix output layout (#135882 ) We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882 Approved by: https://github.com/yanboliang, https://github.com/Chillee	2024-09-13 16:36:05 +00:00
James Wu	ad2f0e9f81	Add remote cache time saved to compilation metrics (#135490 ) Summary: Record remote cache time saved via frame_phase_timing We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved. Test Plan: Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized. Show that column exists in table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float. Reviewed By: aorenste Differential Revision: D62106921 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490 Approved by: https://github.com/aorenste	2024-09-13 16:35:51 +00:00
Edward Z. Yang	21ffa18ad1	Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933 Approved by: https://github.com/angelayi	2024-09-13 15:23:42 +00:00
eqy	2519e5a8de	[CUDA][FP8] Skip rowwise scaling test on sm89 (#135718 ) Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718 Approved by: https://github.com/Skylion007	2024-09-13 15:07:20 +00:00
Laith Sakka	ba6e0f31ab	Remove cycle dependency by localizing the import. (#135926 ) Summary: Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config. ``` File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module> ``` https://fburl.com/logarithm/ol5kx0ee complaining about a cycle dependency this fix it. Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages Reviewed By: aorenste Differential Revision: D62616765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007	2024-09-13 15:05:41 +00:00
PyTorch MergeBot	7ed0563cad	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	eb7dd91dd1	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	3f30360d05	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	4734e356d6	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	ac169795a9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	fca58bfda1	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	dc71e7a7d4	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	1cdf658f4a	Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 )" This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68. Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))	2024-09-13 12:35:05 +00:00
PyTorch MergeBot	b5c52e96e8	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))	2024-09-13 12:29:03 +00:00
Bin Bao	ea2ecab15b	[AOTI][reland] Fix assert_function call in cpu autotune template (#135920 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Test Plan: CI Differential Revision: D62500592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920 Approved by: https://github.com/chenyang78	2024-09-13 12:21:57 +00:00
CaoE	2f53d570fe	Update document for autocast on CPU (#135299 ) Update document for autocast on CPU due to the support of float16 and changes in the operator list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars	2024-09-13 09:11:47 +00:00
Ke Wen	31007cf200	[Distributed] add FP8 support to NaN checker (#135891 ) Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891 Approved by: https://github.com/wconstab	2024-09-13 08:43:54 +00:00
Michael Lazos	c56728b643	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-13 08:41:32 +00:00
Michael Lazos	7d5e0dd4b1	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-13 08:41:32 +00:00
Michael Lazos	2af3b8ffd8	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-13 08:41:24 +00:00
Michael Lazos	0c080cb2c7	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-13 08:41:17 +00:00
Michael Lazos	30b007bea3	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-13 08:41:07 +00:00
Michael Lazos	fafdd588f2	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-13 08:41:00 +00:00
Michael Lazos	e504fb7069	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-13 08:40:50 +00:00
Jez Ng	b346e99376	remove fast_flush arguments (#135387 ) I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value. Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-09-13 08:13:46 +00:00
Animesh Jain	7dc1788396	[inductor] Remove the batch fusion passes from being a default (#135922 ) Ads team do a search internally to figure out which fusion passes to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922 Approved by: https://github.com/eellison, https://github.com/yanboliang ghstack dependencies: #135819	2024-09-13 06:07:33 +00:00
xinan.lin	9fd54d787d	[Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656 Approved by: https://github.com/EikanWang, https://github.com/zou3519	2024-09-13 05:27:56 +00:00
xingyuan li	b38be727eb	[Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_torchinductor_opinfo.py` Reuse `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556 Approved by: https://github.com/etaf, https://github.com/eellison	2024-09-13 05:16:28 +00:00
Jokeren	e54b559e88	[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 ) Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406 Approved by: https://github.com/jansel	2024-09-13 04:10:41 +00:00
wz337	eea5e6ff0f	[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763 ) Fix https://github.com/pytorch/pytorch/issues/134095 This is a workaround for loading full state dict into a FSDP1+TP 2D model. Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do: - load the full state dict into a 1D FSDP model - dcp.save the full/shard state dict into storage - initialize a 2D FSDP1+TP model - get the default sharded state dict for the 2D model (full_state_dict=False) - dcp.load the state dict from storage - load the state dict into the 2D model Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763 Approved by: https://github.com/fegin ghstack dependencies: #135725	2024-09-13 03:51:14 +00:00
Pian Pawakapan	6df91b5917	real tensor prop for composite ops (#135717 ) Fixes #135632 Adds real tensor propagation for decompositions, checking any symbols on their outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717 Approved by: https://github.com/ezyang	2024-09-13 03:35:16 +00:00
wz337	0cdc6a8dcd	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-13 03:26:36 +00:00
Prachi Gupta	6cdc70bccd	[ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917 Approved by: https://github.com/malfet	2024-09-13 02:46:48 +00:00
Yu, Guangye	e6b68359d7	Fix xpu memory stats error (#135818 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135726 After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size. # Additional Context Add a UT to guard this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818 Approved by: https://github.com/EikanWang	2024-09-13 02:41:21 +00:00
Nikita Shulga	1c04cbfba6	[BE] Use `C10_UNUSED` (#135914 ) Instead of `(void)foo; // Suppress unused variable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914 Approved by: https://github.com/huydhn, https://github.com/eqy	2024-09-13 02:27:07 +00:00
Shivam Raikundalia	062681a0ed	[Profiler] Torch Profiler distributed info is not JSON serializable (#135548 ) Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON. Test Plan: Added unit test to check that numpy values can be serialized Differential Revision: D62411619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548 Approved by: https://github.com/aaronenyeshi, https://github.com/albanD	2024-09-13 02:22:33 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Jason Ansel	bf68e16e94	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-13 01:14:18 +00:00
eqy	d732df7e56	[Inductor] Disable TF32 in `test_slice_scatter_reinplace` (#135709 ) TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709 Approved by: https://github.com/eellison	2024-09-13 00:30:45 +00:00
Sahan Paliskara	c9de2efde6	[Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894 ) Addresses https://github.com/pytorch/pytorch/issues/135880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-09-13 00:19:42 +00:00
Jason Ansel	1f15c0c7a5	[fx] Replace _snake_case with a regexp (#135822 ) ~2x speedup on this function, though saves <0.5s overall Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820, #135821	2024-09-13 00:18:41 +00:00
Jason Ansel	a72124add9	[fx] Minor optimization in create_arg (#135821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820	2024-09-13 00:18:41 +00:00
Jason Ansel	10ca4c0564	[inductor] Use TracerBase directly in LoopBody (#135820 ) This skips some unneeded work in the subclass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788	2024-09-13 00:18:41 +00:00
Jason Ansel	d3aab9642b	[inductor] Optimize can_fuse_vertical() (#135788 ) An O(n^2) to O(n) improvement by not comparing all pairs of deps. Before: ![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9) After: ![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788 Approved by: https://github.com/oulgen ghstack dependencies: #135787	2024-09-13 00:18:41 +00:00
Jason Ansel	67a929eea8	[inductor] Remove unused check (#135787 ) I think this is unreachable code because mode is always None on reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787 Approved by: https://github.com/oulgen	2024-09-13 00:18:41 +00:00
Isuru Fernando	f576960bbc	do not expand in replace/simplify if no changes (#135863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863 Approved by: https://github.com/ezyang	2024-09-13 00:12:01 +00:00
Nikita Shulga	1aba224cfd	Update nightly PyTorch version to 2.6.0 (#135916 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916 Approved by: https://github.com/kit1980	2024-09-13 00:08:52 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
Ma Jian	00dc7d4356	fix compiled_autograd deadlock throw (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-12 23:24:57 +00:00
Yanbo Liang	1760bbc259	[FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823 ) Fixes #134739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823 Approved by: https://github.com/BoyuanFeng	2024-09-12 23:11:01 +00:00
Jack Taylor	fb9d8e3248	[ROCm] Use ieee precision for fp32 in flex attention (#135702 ) `3bebc09be9` Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-09-12 23:00:48 +00:00
eellison	aaabfc8930	[Easy] Check if quant registered in constant folding (#135875 ) Belated fix for https://github.com/pytorch/pytorch/issues/110904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875 Approved by: https://github.com/shunting314	2024-09-12 22:16:39 +00:00
William Wen	63d6cd351a	[dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404 ) Fixes https://github.com/pytorch/pytorch/issues/134608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-09-12 22:04:48 +00:00
PyTorch MergeBot	3de9e474df	Revert "Check function declarations of Core ML code (#135467 )" This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63. Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))	2024-09-12 22:04:35 +00:00
PyTorch MergeBot	3e1a4ea132	Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 )" This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d. Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](`83c594ebd6`) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))	2024-09-12 21:47:38 +00:00
Sanskar Modi	e157ce3ebb	Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 ) Adding validation checks to check the input types and display better error messages for the same. Fixes https://github.com/pytorch/pytorch/issues/135463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596 Approved by: https://github.com/malfet	2024-09-12 21:28:37 +00:00
Pian Pawakapan	b897ab0540	[export] ignore mark_dynamic() in export (#135536 ) Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`. Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536 Approved by: https://github.com/avikchaudhuri	2024-09-12 21:22:19 +00:00
Fadi Arafeh	3d24313809	Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 ) Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR* the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058 Approved by: https://github.com/jondea, https://github.com/malfet	2024-09-12 20:30:20 +00:00
Riley Dulin	cd472bb1e3	[torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553 ) Summary: Sometimes we only want to generate a replacement for a matched pattern once we know some information about the nodes in the pattern. So far, we have found this the most useful to do matches based on specific shapes of tensors flowing into functions. Use a callback function similar to `match_filters`. By default this isn't used. Had to make `replacement` a None-able parameter because Callable was already used to detect a case where a graph needed to be traced. Differential Revision: D62412628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553 Approved by: https://github.com/SherlockNoMad	2024-09-12 18:52:14 +00:00
Guilherme Leobas	f032135bbf	Add batching rule for torch.scatter_reduce (#135547 ) Fixes #134797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547 Approved by: https://github.com/zou3519	2024-09-12 18:51:21 +00:00
Joel Schlosser	525bec804c	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-12 17:54:25 +00:00
wz337	83c594ebd6	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-12 17:43:57 +00:00
Rachel Guo	c1277945d3	[AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731 ) Summary: As title. Effect after merging this diff would look something like this: ``` print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0) triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0) print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0) buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32) # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm] print('inductor: before_launch - extern_kernels.addmm - buf0', buf0) extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1) print('inductor: after_launch - extern_kernels.addmm - buf0', buf0) ``` Context: D62272588 only support major triton kernel jit inductor debug printing codegen Test Plan: CI & OSS CI Reviewed By: chenyang78, ColinPeppler Differential Revision: D62397017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731 Approved by: https://github.com/ColinPeppler	2024-09-12 17:31:10 +00:00
Isuru Fernando	dab7d646d5	Use a better decomposition for split_with_sizes (#135728 ) This decomposition has less checks and improves the performance of torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728 Approved by: https://github.com/ezyang	2024-09-12 16:38:51 +00:00
whywhy-rtx3090	7647c398ff	Allow optional positional arguments for `torch.func.functional_call` (#134643 ) This PR resolves #134408. Add an additional test and have passed the local test. Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs. This PR does not include any such post-check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643 Approved by: https://github.com/zou3519	2024-09-12 15:22:06 +00:00
Justin Chu	d67cc58181	[ONNX] Fix symbolic values and numpy implementation (#135786 ) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes https://github.com/pytorch/pytorch/issues/135700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786 Approved by: https://github.com/titaiwangms	2024-09-12 14:24:43 +00:00
Animesh Jain	dddaadac6c	[dynamo] Dont graph break on inner torch.compile (#135819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819 Approved by: https://github.com/jansel	2024-09-12 11:39:09 +00:00
Jason Ansel	02169364e1	[inductor] Split reduction loops when there is no shared reads (#134307 ) Fixes #129102 ![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307 Approved by: https://github.com/shunting314	2024-09-12 09:45:08 +00:00
Yanbo Liang	c30042fbeb	[GPT-fast] Update compilation time target for Llama & Mixtral (#135817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817 Approved by: https://github.com/xmfan, https://github.com/huydhn	2024-09-12 07:13:44 +00:00
Sun, Jiayi	6700175531	[Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574 ) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-12 06:56:34 +00:00
Xilun Wu	de8a8653c0	[dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554 ) Summary 1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`. 2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks. Test `pytest test/distributed/_tensor/test_dtensor.py` `pytest test/distributed/_tensor/test_init.py` `pytest test/distributed/_tensor/test_tensor_ops.py` Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-09-12 06:30:09 +00:00
Jason Ansel	86335e9135	[reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735 Approved by: https://github.com/oulgen	2024-09-12 05:50:39 +00:00
angelayi	14e3f3c062	[aoti] Remove nlohmann/json.hpp from header (#135765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765 Approved by: https://github.com/malfet	2024-09-12 05:38:51 +00:00
Dmitry Rogozhkin	9852c6d236	xpu: fix 3rd party builds on systems with cmake<3.25 (#135767 ) Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement. See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html Fixes: #135766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767 Approved by: https://github.com/malfet, https://github.com/guangyey	2024-09-12 05:31:01 +00:00
Jason Ansel	6354271178	[inductor] Skip unused call to get_estimated_runtime() (#135776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776 Approved by: https://github.com/oulgen ghstack dependencies: #135445, #135446	2024-09-12 05:22:23 +00:00
Jason Ansel	12902f6ecf	[inductor] Cache get_operation_names/get_buffer_names (#135446 ) Before: ![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311) After: ![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446 Approved by: https://github.com/oulgen ghstack dependencies: #135445	2024-09-12 05:22:23 +00:00
Jason Ansel	3decb676aa	[inductor] Optimize cache_on_self (#135445 ) This is a small compile time win, but also makes profiles more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445 Approved by: https://github.com/oulgen	2024-09-12 05:22:23 +00:00
Zhenbin Lin	8d68a02905	OpenReg: Split the daemon into drvier/executor (#135646 ) Split the daemon into a proper user-process driver vs device-process executor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646 Approved by: https://github.com/albanD	2024-09-12 05:03:46 +00:00
Jason Ansel	28330a8a39	[reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733 Approved by: https://github.com/oulgen	2024-09-12 04:29:37 +00:00
Animesh Jain	eaba287adb	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg	2024-09-12 04:05:08 +00:00
cyy	f5f1d0a753	Fix build warnings for torch_python (#134981 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981 Approved by: https://github.com/ezyang	2024-09-12 03:59:34 +00:00
Adam J. Stewart	5bc238c73e	torch.hub: add get_dir/set_dir type hints (#134906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906 Approved by: https://github.com/Skylion007	2024-09-12 03:53:29 +00:00
He Kai	79223114db	Avoid inserting extra transpose when the input to group norm is NHWC (#135575 ) When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575 Approved by: https://github.com/ezyang	2024-09-12 03:36:05 +00:00
cyy	7cfd23636c	Fix clang-tidy warnings in Caffe2 code (#134935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935 Approved by: https://github.com/ezyang	2024-09-12 03:27:09 +00:00
Feng Yuan	0d1d69fd25	Update torch-xpu-ops pin (ATen XPU implementation) (#135647 ) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647 Approved by: https://github.com/EikanWang	2024-09-12 03:16:08 +00:00
Aaron Orenstein	21a64d57b1	[BE] typing for decorators - masked/_ops (#135108 ) Differential Revision: D62184735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108 Approved by: https://github.com/Skylion007	2024-09-12 01:34:09 +00:00
Shangdi Yu	1a74952925	"Remove BLOCK_LIST" (#135729 ) Summary: Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir. Remove BLOCK_LIST since it's empty. Now all internal unittests will use training ir. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder buck2 run 'fbcode//mode/dev-nosan' caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder ``` Differential Revision: D62387987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729 Approved by: https://github.com/tugsbayasgalan	2024-09-12 01:22:06 +00:00
Huy Do	a130ed828a	Fix the upload of x86 micro benchmark results (#135780 ) Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042. So, the workflow is running but nothing has been uploaded yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780 Approved by: https://github.com/atalman	2024-09-12 01:16:38 +00:00
Menglu Yu	eb0fe02933	[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 ) Summary: We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression - Only happens in A100 - Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding - To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}}) Test Plan: # unit test ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm ``` Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841 Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651 Network: Up: 9.0KiB Down: 142B (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e) Jobs completed: 9. Time elapsed: 3:18.0s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e test see [D62388582](https://www.internalfb.com/diff/D62388582) Differential Revision: D62220158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167 Approved by: https://github.com/jackiexu1992	2024-09-12 00:51:34 +00:00
Wei Feng	d270e2d240	[FSDP2] better error msg for cpu offloading (#135156 ) when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156 Approved by: https://github.com/awgu	2024-09-12 00:05:07 +00:00
xinan.lin	16b37b309f	[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313 Approved by: https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #135312	2024-09-11 23:59:54 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Alexander Jipa	5ca46be15e	Fix/torch cat doc attr (#135698 ) The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698 Approved by: https://github.com/albanD Co-authored-by: Alexander Jipa <azzhipa@amazon.com>	2024-09-11 22:32:55 +00:00
Mayank Mishra	9a04cfbeff	fix for fp16 (#134106 ) This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2024-09-11 22:02:07 +00:00
Shubham Bhokare	66db61f0d1	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-11 21:29:04 +00:00
PyTorch MergeBot	c025f7becc	Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317 )" This reverts commit e004d539da3335d97a8134c9081245628f18eb67. Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))	2024-09-11 21:27:53 +00:00
FFFrog	8c4e1148b8	Refactoring byte_order (#135558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558 Approved by: https://github.com/mikaylagawarecki	2024-09-11 21:06:43 +00:00
Nikita Shulga	e20ee39558	Expand bitwise ops to unsigned types (#135525 ) Fixes https://github.com/pytorch/pytorch/issues/135436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525 Approved by: https://github.com/ezyang	2024-09-11 20:48:52 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Sidney Tsang	5d964a5eb7	[Export] Fix SDPA decomposition (#135297 ) Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary. Test Plan: CI Differential Revision: D62278378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297 Approved by: https://github.com/drisspg	2024-09-11 20:21:59 +00:00
Bin Bao	118d7e1480	[Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694 ) Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly. Differential Revision: D62500331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694 Approved by: https://github.com/masnesral	2024-09-11 20:07:11 +00:00
Bob Ren	dd47f6f623	Simplify expr before getting implications in _maybe_evaluate_static (#135499 ) Fixes #134268 Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499 Approved by: https://github.com/ezyang	2024-09-11 19:48:29 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Shangdi Yu	ad75b09d89	Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86 ``` CI Differential Revision: D62448302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623 Approved by: https://github.com/tugsbayasgalan	2024-09-11 19:23:08 +00:00
rzou	a2cb9b7331	Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581 Approved by: https://github.com/eellison ghstack dependencies: #135530	2024-09-11 18:43:18 +00:00
Edward Z. Yang	451eaf0ff2	Log full exception trace when error raised in Dynamo (#135697 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697 Approved by: https://github.com/Skylion007	2024-09-11 18:14:33 +00:00
Zain Rizvi	09519eb195	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-11 18:01:26 +00:00
Bob Ren	5314ae2660	Don't use exception chaining for BackendCompilerFailed (#135545 ) Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545 Approved by: https://github.com/ezyang	2024-09-11 17:49:18 +00:00
Jack Taylor	da587de9cb	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: `if torch.version.hip is not None:` Which was incorrectly replaced by: `if self.device_props.type != "hip":` Another occurence of https://github.com/pytorch/pytorch/pull/130617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852 Approved by: https://github.com/masnesral, https://github.com/malfet	2024-09-11 17:21:40 +00:00
Jithun Nair	82a4df2d5f	[CI] [ROCm] Run rocm workflow on every push to main branch (#135644 ) Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644 Approved by: https://github.com/huydhn	2024-09-11 17:21:05 +00:00
Catherine Lee	18a9030952	[CI] Fix update slow tests (#135390 ) * Add pytorchbot to list of approvers for file * Add labels to the auto created PR The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal idk if this has much value, clearly we've been managing without the update Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390 Approved by: https://github.com/ZainRizvi	2024-09-11 17:02:17 +00:00
Isuru Fernando	03f23d07b4	Optimize ShapeEnv.replace (#135652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652 Approved by: https://github.com/ezyang ghstack dependencies: #135621, #135622	2024-09-11 16:50:59 +00:00
Isuru Fernando	8c738c9270	Improve performance of sympy_generic_le (#135622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622 Approved by: https://github.com/ezyang ghstack dependencies: #135621	2024-09-11 16:20:03 +00:00
Isuru Fernando	7ddacaf40a	Improve performance of canonicalize_bool_expr (#135621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621 Approved by: https://github.com/ezyang	2024-09-11 16:20:03 +00:00
PyTorch MergeBot	183c32fd3b	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 0d15122092c27fec1143b800bab7c996d126b547. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))	2024-09-11 15:57:00 +00:00
PyTorch MergeBot	3ab12e2596	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))	2024-09-11 15:53:55 +00:00
PyTorch MergeBot	596e93b506	Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 )" This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1. Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](`5c3d0a2ded`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))	2024-09-11 15:51:12 +00:00
PyTorch MergeBot	f96e8041b1	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))	2024-09-11 15:48:27 +00:00
PyTorch MergeBot	7cf9c81918	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))	2024-09-11 15:39:21 +00:00
Sam Larsen	49e0b88aab	Fix test_triton_kernel_float64_constant (#135583 ) Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583 Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007	2024-09-11 15:16:23 +00:00
Pushpak Raj Gautam	ee8c5cc1cc	For S444023: Back out "deprecate `search_autotune_cache` (#133628 )" (#135186 ) Summary: For S444023 Test Plan: Revert prevented the NaN errors - f639391901 Training job ran for 7767 iterations. NaN errors show up within the first 1k. Reviewed By: nmacchioni Differential Revision: D62224747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186 Approved by: https://github.com/kit1980	2024-09-11 14:08:40 +00:00
Nikita Lutsenko	ce4d146f56	ATen \| Fix MPSCNNNeuron creation on Mac Catalyst. (#135595 ) Summary: These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into. This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed. Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745} Reviewed By: MichaelTay Differential Revision: D62430010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595 Approved by: https://github.com/MichaelTay	2024-09-11 11:12:23 +00:00
Amadeusz Skrzypczak	0226fcaacf	Disable cuda specific restrictions in _scaled_mm for other devices (#135579 ) Fixes #135576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579 Approved by: https://github.com/drisspg	2024-09-11 11:05:38 +00:00
Yanbo Liang	4cde5096c4	[Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629 ) Fixes #134560 and #135206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629 Approved by: https://github.com/drisspg	2024-09-11 08:10:50 +00:00
Ke Wen	443c015393	[Distributed] Improve efficiency of NaN checker (#135414 ) Some customers would like to run the NaN checks on the fly, so we are improving its efficiency. ## Benchmarking Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1` Red kernel: ncclAllreduce Blue kernel: Nan check <img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3"> ## Comparison with torch ops: Let's say a user manually check for NaNs with the following torch ops before all-reduce: ``` torch.any(torch.isnan(x)) ``` <img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b"> So our perf is on-par with torch ops. ## Changes - Load from vidmem using "big packs" of 16 bytes - Bump `blockDim.x` from 256 to 512 - Separate loads and checks into two loops, each of 8 iterations - Unroll the loops - Templated functions for checking NaN in a "big pack" based on dtype Special thanks to @jbachan from NCCL! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414 Approved by: https://github.com/wconstab	2024-09-11 07:53:42 +00:00
Yiming Zhou	4ae6d7c18f	Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634 ) Summary: Broke some tests. Revert this diff Test Plan: CI Differential Revision: D62474337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634 Approved by: https://github.com/tugsbayasgalan	2024-09-11 06:16:26 +00:00
Eddie Yan	3084b7b5c0	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-11 05:59:25 +00:00
Animesh Jain	5c3d0a2ded	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg ghstack dependencies: #135588	2024-09-11 05:23:42 +00:00
fduwjj	c608b17f60	[PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496 ) While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road. Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496 Approved by: https://github.com/wconstab	2024-09-11 04:42:25 +00:00
Michael Lazos	444b52ff40	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-11 04:18:22 +00:00
Michael Lazos	160c228a4b	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-11 04:18:22 +00:00
Michael Lazos	0d15122092	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-11 04:18:22 +00:00
Michael Lazos	6a3edfcc1e	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-11 04:18:22 +00:00
penguin-wwy	356f14e7b7	Fix the output of FileCheck when not run and add unit tests (#135345 ) When FileCheck is destructed without execution, it should output all rules. For example: ``` >>> fc = FileCheck().check("test") >>> del fc You have not run this instance of FileCheck! FileCheck checks: CHECK: test ``` Additionally, unit tests for the Python interface of FileCheck will be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345 Approved by: https://github.com/eellison	2024-09-11 04:13:24 +00:00
Sathyanarayanan Saravanamuthu	34dc8f69a1	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/fduwjj	2024-09-11 03:35:02 +00:00
angelayi	cd9ee49a69	[aoti] Add cpp loader (#135374 ) * Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python... * Added a new config, `aot_inductor.package_cpp_only` which will not package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users. * Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config. * Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`. * `load_package` will load a singular model, given the model name. * The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows? Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374 Approved by: https://github.com/desertfire, https://github.com/malfet	2024-09-11 03:00:01 +00:00
chuanqiw	26e5572dd2	Bump triton xpu pin and release version (#135638 ) Similar with https://github.com/pytorch/pytorch/pull/135627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638 Approved by: https://github.com/atalman	2024-09-11 00:56:15 +00:00
Animesh Jain	693897df42	[dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041 ) Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041 Approved by: https://github.com/ezyang	2024-09-11 00:43:26 +00:00
Nikita Shulga	3bf6be457d	[MPS] Add missing dispatch to rshift.Tensor (#135607 ) Missed it while working on https://github.com/pytorch/pytorch/pull/131813 Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607 Approved by: https://github.com/manuelcandales	2024-09-11 00:20:53 +00:00
titaiwangms	492f064f15	[ONNX] Add assertion nodes to ignoring list (#135591 ) Fixes #135419 PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591 Approved by: https://github.com/justinchuby	2024-09-11 00:18:17 +00:00
rzou	29408ea81a	Add option to tweak inductor stride settings for user-defined triton kernels (#135530 ) Previously, Inductor was allowed to modify the stride/storage_offset (layout) for inputs to user-defined triton kernels. This can cause silent incorrectness because most triton kernels are written for a specific striding pattern (usually contiguous). This PR adds a config to allow the user to choose Inductor's behavior on this. The options are: - "flexible_layout" (default): Inductor can modify the layout for inputs to user-defined triton kernels as much as it wants. - "needs_fixed_stride_order": Inductor must preserve the stride order (when compared to tracing) for inputs to user-defined triton kernels. This matches our handling for custom operators. In the future, we'll want a "needs_exact_strides" option (this is the safest option). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530 Approved by: https://github.com/FindHao, https://github.com/oulgen	2024-09-11 00:11:17 +00:00
Haoming Lu	02dcb07765	Add boolean support in pack segments ops for both cpu and cuda impls (#132897 ) (#135620 ) Summary: Same as int types, forward only. bypass-github-export-checks diff has been synced to github Test Plan: buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/ Reviewed By: garroud Differential Revision: D60785563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620 Approved by: https://github.com/kit1980 Co-authored-by: Haoming Lu <haominglu@meta.com>	2024-09-11 00:03:17 +00:00
Animesh Jain	5c38aa72c0	[dynamo][dicts][nv-embed] Support update with kwargs (#135588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588 Approved by: https://github.com/yanboliang	2024-09-10 23:50:23 +00:00
atalman	5134ba7458	Bump triton pin and release version (#135627 ) Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627 Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet	2024-09-10 23:46:36 +00:00
titaiwangms	e48ee2cf50	[ONNX] Fix scaled_dot_product_attention with float scale (#135594 ) Fixes #125158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594 Approved by: https://github.com/justinchuby	2024-09-10 23:04:02 +00:00
hongxyan	eb38ee21ba	[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397 ) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2*30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397 Approved by: https://github.com/eqy, https://github.com/malfet	2024-09-10 21:03:01 +00:00
Shunting Zhang	8057b72763	[ez][inductor] don't benchmark cloning if there are no mutated args (#135533 ) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533 Approved by: https://github.com/jansel ghstack dependencies: #135531	2024-09-10 20:54:31 +00:00
Shunting Zhang	7b17918dc9	[inductor] fix a device sync issue for benchmarking fusion (#135531 ) Fix https://github.com/pytorch/pytorch/issues/134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531 Approved by: https://github.com/jansel	2024-09-10 20:54:31 +00:00
Yiming Zhou	66c45f3ed9	[export] fix re-export custom metadata (#135282 ) Fixes #134778 When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`). This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`. Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282 Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-09-10 20:15:02 +00:00
PyTorch MergeBot	0a9d55d2ee	Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086 )" This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd. Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))	2024-09-10 19:51:16 +00:00
Catherine Lee	4ca65d3323	[CI] Increase sharding for jobs that are timing out (#135582 ) Increase sharding for * slow grad check * slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test * avx Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-10 19:45:13 +00:00
Andrew Gu	c932b39739	[FSDP2] Added `_set_unshard_async_op` (#135523 ) This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation. If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute. Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523 Approved by: https://github.com/weifengpy	2024-09-10 19:28:02 +00:00
Rachel Guo	1f15973657	[AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285 ) Summary: 1. Add the debug printer call to a level lower for triton kernel python wrapper codegen path 2. Add `torch.save()` for jit inductor as well 3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing) Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D62272588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285 Approved by: https://github.com/chenyang78	2024-09-10 19:24:58 +00:00
Dan Zimmerman	fc88ba260f	[amdsmi][torch] Update amdsmi API usages (#135504 ) Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see `7b2463abe0` for the changes Test Plan: CI Differential Revision: D62325661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504 Approved by: https://github.com/eqy, https://github.com/houseroad	2024-09-10 19:15:39 +00:00
Sam Larsen	bf8d0e3107	[inductor] Enable subprocess parallel compile internally with killswitch (#132467 ) Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467 Approved by: https://github.com/eellison	2024-09-10 19:05:46 +00:00
Shivam Raikundalia	3a1239a248	[Profiler] Harden Record Function Kwargs (#135365 ) Summary: In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following. 1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future. 2. Make sure that any boolean is lowercase when a string so that the JSON does not break when parsing it 3. Force stream parameter to be an int Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only. Differential Revision: D62304843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365 Approved by: https://github.com/aaronenyeshi	2024-09-10 18:44:05 +00:00
Sam Larsen	4f9f1775d8	Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370 ) Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern. Test Plan: ``` python test/inductor/test_combo_kernels.py python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370 Approved by: https://github.com/jansel	2024-09-10 18:43:14 +00:00
Thanh Ha	5e0788befb	Migrate remaining jobs to use runner determinator (#134867 ) At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over. Issue: https://lf-pytorch.atlassian.net/browse/PC-25 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867 Approved by: https://github.com/ZainRizvi	2024-09-10 18:14:00 +00:00
Ivan Zaitsev	440f8f57af	Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079 )" (#135562 ) This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8. #135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562 Approved by: https://github.com/jansel, https://github.com/seemethere	2024-09-10 18:07:11 +00:00
Zhou, Lingzhi	e004d539da	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-10 17:45:29 +00:00
Zixi Qi	c4b84a46a9	Add more logging to TunableOp validators (#135396 ) Summary: Add more logging to TunableOp validators Test Plan: Verified additional logging when loading kernel selections: ``` ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 ``` ``` [qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d\|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning File changed: fbcode//hipblas_tuning_pt_llama0.csv Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022 Network: Up: 0B Down: 0B Jobs completed: 4189. Time elapsed: 0.2s. BUILD SUCCEEDED Enabled tuning - Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16 INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830 INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0 reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator HIPBLASLT_VERSION=800-a15e4178 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s - Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16 Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s - Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16 Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s - Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16 Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s 2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412 2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075 2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985 2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748 ``` Reviewed By: leitian Differential Revision: D62322830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396 Approved by: https://github.com/eqy	2024-09-10 17:20:59 +00:00
cyy	bc1b8f094d	Check function declarations of Core ML code (#135467 ) Relax the restrictions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467 Approved by: https://github.com/ezyang	2024-09-10 16:05:22 +00:00
rzou	f65a564fa2	[inductor] Flip custom_op_default_layout_constraint (#135239 ) By default, Inductor should respect the stride order of input Tensors to custom operators. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239 Approved by: https://github.com/albanD ghstack dependencies: #135391	2024-09-10 14:27:43 +00:00
Edward Z. Yang	386b313028	Handle KeyError for compiler collective in scalars too (#135385 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385 Approved by: https://github.com/jansel	2024-09-10 12:33:04 +00:00
torotoki	6d7cbc20d2	Add dynamo itertools.pairwise support (#135416 ) Fixes #133766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416 Approved by: https://github.com/XuehaiPan, https://github.com/jansel Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2024-09-10 11:37:59 +00:00
xinan.lin	ca16956b20	[Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #134693	2024-09-10 10:11:52 +00:00
xinan.lin	67735d1ee8	[Inductor] Generalize `is_cuda` to specific device_type to make cpp_wrapper mode be extensible (#134693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel	2024-09-10 10:11:13 +00:00
Boyuan Feng	6e13f5eb38	[FlexAttention] Add broadcast support for kv batch dimension (#135505 ) This PR adds broadcast support for KV batch dimension. ## Details Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention. This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward. ## Benchmark GPU: H100 We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`. ``` python benchmarks/transformer/score_mod.py --calculate-bwd ``` ### Perf before this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|------------------------------\| \| Average \| 0.743 \| \| \| \| \| \| Max \| 0.955 \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.548 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.834 \| \| \| \| \| \| Max \| 1.261 \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| \| Min \| 0.456 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 107.040 \| 140.800 \| 0.888 \| 0.760 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.840 \| 19.744 \| 112.576 \| 140.064 \| 0.802 \| 0.804 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.232 \| 17.344 \| 87.744 \| 142.496 \| 0.878 \| 0.616 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 108.192 \| 143.328 \| 0.888 \| 0.755 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.904 \| 22.400 \| 106.432 \| 136.512 \| 0.889 \| 0.780 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.424 \| 26.752 \| 91.712 \| 106.688 \| 0.726 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.808 \| 22.432 \| 89.024 \| 101.920 \| 0.883 \| 0.873 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.840 \| 22.272 \| 88.896 \| 102.592 \| 0.891 \| 0.867 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.240 \| 32.416 \| 116.768 \| 112.256 \| 0.933 \| 1.040 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 29.536 \| 37.024 \| 113.664 \| 102.688 \| 0.798 \| 1.107 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.656 \| 32.800 \| 116.992 \| 127.008 \| 0.935 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.592 \| 32.480 \| 116.928 \| 112.160 \| 0.942 \| 1.043 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.920 \| 198.656 \| 204.512 \| 0.653 \| 0.971 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 37.760 \| 62.528 \| 189.536 \| 170.624 \| 0.604 \| 1.111 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.896 \| 62.368 \| 198.304 \| 205.824 \| 0.656 \| 0.963 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.952 \| 198.432 \| 203.648 \| 0.653 \| 0.974 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 318.528 \| 355.904 \| 947.232 \| 1162.496 \| 0.895 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 199.776 \| 252.128 \| 677.792 \| 813.184 \| 0.792 \| 0.834 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 316.512 \| 363.328 \| 947.712 \| 1361.984 \| 0.871 \| 0.696 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 317.984 \| 356.864 \| 947.264 \| 1165.024 \| 0.891 \| 0.813 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 446.656 \| 734.656 \| 1664.288 \| 2172.960 \| 0.608 \| 0.766 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 278.688 \| 467.648 \| 1182.624 \| 1339.296 \| 0.596 \| 0.883 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 447.872 \| 744.096 \| 1662.944 \| 2196.544 \| 0.602 \| 0.757 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 448.128 \| 732.928 \| 1663.072 \| 2156.800 \| 0.611 \| 0.771 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.648 \| 16.640 \| 107.520 \| 143.008 \| 0.940 \| 0.752 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.776 \| 18.240 \| 129.056 \| 141.920 \| 0.865 \| 0.909 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.168 \| 16.640 \| 103.616 \| 139.648 \| 0.912 \| 0.742 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.616 \| 16.640 \| 128.608 \| 164.448 \| 0.938 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 21.952 \| 125.344 \| 170.304 \| 0.901 \| 0.736 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 23.712 \| 104.288 \| 196.896 \| 0.834 \| 0.530 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.072 \| 21.952 \| 102.080 \| 177.056 \| 0.869 \| 0.577 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.648 \| 21.920 \| 109.920 \| 170.848 \| 0.896 \| 0.643 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.936 \| 127.808 \| 228.832 \| 0.954 \| 0.559 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 29.472 \| 33.856 \| 113.152 \| 215.072 \| 0.871 \| 0.526 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.496 \| 32.160 \| 116.576 \| 231.744 \| 0.948 \| 0.503 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.904 \| 116.320 \| 229.824 \| 0.955 \| 0.506 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.480 \| 61.440 \| 176.448 \| 345.312 \| 0.659 \| 0.511 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 38.304 \| 59.424 \| 169.312 \| 371.360 \| 0.645 \| 0.456 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.960 \| 61.760 \| 176.512 \| 358.912 \| 0.663 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.352 \| 61.696 \| 176.512 \| 344.928 \| 0.654 \| 0.512 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.224 \| 357.728 \| 905.728 \| 1668.448 \| 0.884 \| 0.543 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 199.904 \| 248.416 \| 636.544 \| 1109.088 \| 0.805 \| 0.574 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 314.880 \| 363.616 \| 906.304 \| 1658.176 \| 0.866 \| 0.547 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.160 \| 354.368 \| 906.080 \| 1649.024 \| 0.892 \| 0.549 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.912 \| 739.840 \| 1555.808 \| 2521.952 \| 0.604 \| 0.617 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 279.776 \| 463.904 \| 1068.928 \| 1849.888 \| 0.603 \| 0.578 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.080 \| 748.960 \| 1553.504 \| 2629.888 \| 0.596 \| 0.591 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.208 \| 740.608 \| 1558.880 \| 2524.960 \| 0.602 \| 0.617 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 33.568 \| 41.280 \| 170.016 \| 147.584 \| 0.813 \| 1.152 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 30.688 \| 43.040 \| 159.552 \| 146.720 \| 0.713 \| 1.087 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.112 \| 41.504 \| 170.112 \| 152.672 \| 0.822 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.240 \| 41.152 \| 170.272 \| 134.976 \| 0.832 \| 1.261 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.672 \| 76.416 \| 295.296 \| 263.648 \| 0.637 \| 1.120 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.088 \| 72.576 \| 281.920 \| 237.664 \| 0.621 \| 1.186 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.032 \| 76.672 \| 295.520 \| 265.248 \| 0.626 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.096 \| 76.096 \| 295.456 \| 262.112 \| 0.632 \| 1.127 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.920 \| 111.232 \| 401.568 \| 382.944 \| 0.844 \| 1.049 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 68.192 \| 95.232 \| 338.752 \| 326.816 \| 0.716 \| 1.037 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.984 \| 111.840 \| 401.856 \| 444.224 \| 0.840 \| 0.905 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 94.176 \| 110.496 \| 401.600 \| 383.136 \| 0.852 \| 1.048 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.488 \| 227.040 \| 727.424 \| 739.712 \| 0.579 \| 0.983 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 95.616 \| 169.760 \| 616.864 \| 574.112 \| 0.563 \| 1.074 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.680 \| 228.672 \| 727.616 \| 746.048 \| 0.576 \| 0.975 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.104 \| 225.696 \| 727.904 \| 735.392 \| 0.581 \| 0.990 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1227.296 \| 1386.656 \| 3720.192 \| 4539.904 \| 0.885 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 691.360 \| 831.712 \| 2515.872 \| 3067.808 \| 0.831 \| 0.820 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1228.192 \| 1403.136 \| 3715.520 \| 5309.280 \| 0.875 \| 0.700 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1229.024 \| 1384.992 \| 3715.904 \| 4550.368 \| 0.887 \| 0.817 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1784.832 \| 2865.888 \| 6539.840 \| 8460.224 \| 0.623 \| 0.773 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1017.408 \| 1660.480 \| 4369.824 \| 5056.992 \| 0.613 \| 0.864 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1792.448 \| 2904.864 \| 6546.080 \| 8537.024 \| 0.617 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1795.552 \| 2856.864 \| 6544.672 \| 8400.160 \| 0.629 \| 0.779 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.880 \| 148.832 \| 179.936 \| 0.881 \| 0.827 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.168 \| 38.080 \| 138.528 \| 167.552 \| 0.818 \| 0.827 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 39.168 \| 148.512 \| 181.248 \| 0.874 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.784 \| 148.864 \| 180.224 \| 0.883 \| 0.826 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.832 \| 76.352 \| 253.632 \| 295.968 \| 0.640 \| 0.857 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 45.760 \| 65.792 \| 239.040 \| 290.752 \| 0.696 \| 0.822 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.576 \| 253.312 \| 304.032 \| 0.637 \| 0.833 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.192 \| 253.600 \| 296.096 \| 0.640 \| 0.856 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.728 \| 109.728 \| 357.696 \| 498.912 \| 0.854 \| 0.717 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 68.704 \| 92.288 \| 295.616 \| 386.240 \| 0.744 \| 0.765 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.632 \| 111.392 \| 357.408 \| 512.448 \| 0.841 \| 0.697 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.280 \| 109.952 \| 357.696 \| 501.440 \| 0.848 \| 0.713 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.392 \| 230.496 \| 612.224 \| 807.552 \| 0.570 \| 0.758 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 96.512 \| 165.184 \| 502.624 \| 672.384 \| 0.584 \| 0.748 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.360 \| 232.608 \| 612.064 \| 832.320 \| 0.565 \| 0.735 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.008 \| 230.528 \| 612.640 \| 804.320 \| 0.568 \| 0.762 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1227.968 \| 1377.408 \| 3477.920 \| 5324.384 \| 0.892 \| 0.653 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 695.264 \| 824.544 \| 2268.224 \| 3210.208 \| 0.843 \| 0.707 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.640 \| 1404.576 \| 3476.832 \| 5463.456 \| 0.875 \| 0.636 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.416 \| 1378.752 \| 3478.048 \| 5367.712 \| 0.891 \| 0.648 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1788.736 \| 2867.712 \| 6039.520 \| 8616.256 \| 0.624 \| 0.701 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1021.952 \| 1653.824 \| 3866.208 \| 5306.848 \| 0.618 \| 0.729 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.752 \| 2896.352 \| 6044.128 \| 8871.360 \| 0.617 \| 0.681 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.080 \| 2868.672 \| 6040.160 \| 8550.144 \| 0.623 \| 0.706 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.504 \| 71.552 \| 312.768 \| 255.040 \| 0.804 \| 1.226 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 49.472 \| 71.104 \| 285.696 \| 243.520 \| 0.696 \| 1.173 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 58.112 \| 72.896 \| 312.768 \| 288.256 \| 0.797 \| 1.085 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.952 \| 71.680 \| 312.768 \| 255.552 \| 0.808 \| 1.224 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.336 \| 144.256 \| 580.128 \| 500.160 \| 0.571 \| 1.160 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.160 \| 123.712 \| 552.544 \| 447.648 \| 0.616 \| 1.234 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.400 \| 145.184 \| 580.032 \| 504.032 \| 0.568 \| 1.151 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.368 \| 143.904 \| 580.192 \| 499.936 \| 0.572 \| 1.161 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.216 \| 209.568 \| 787.872 \| 747.712 \| 0.846 \| 1.054 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 121.984 \| 168.256 \| 651.968 \| 628.256 \| 0.725 \| 1.038 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.088 \| 211.488 \| 788.320 \| 864.352 \| 0.837 \| 0.912 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.440 \| 208.576 \| 787.424 \| 749.120 \| 0.851 \| 1.051 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.472 \| 441.376 \| 1405.440 \| 1431.648 \| 0.565 \| 0.982 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 172.960 \| 312.064 \| 1172.064 \| 1096.448 \| 0.554 \| 1.069 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.632 \| 446.336 \| 1405.408 \| 1448.480 \| 0.559 \| 0.970 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 250.944 \| 440.128 \| 1406.624 \| 1421.952 \| 0.570 \| 0.989 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2418.720 \| 2747.936 \| 7330.432 \| 9023.712 \| 0.880 \| 0.812 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1353.696 \| 1608.480 \| 4941.696 \| 6078.752 \| 0.842 \| 0.813 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2427.456 \| 2746.816 \| 7329.792 \| 10539.968 \| 0.884 \| 0.695 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2426.688 \| 2763.168 \| 7336.256 \| 9057.536 \| 0.878 \| 0.810 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3554.240 \| 5634.400 \| 12919.872 \| 16843.489 \| 0.631 \| 0.767 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2003.648 \| 3250.784 \| 8610.144 \| 10015.424 \| 0.616 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3582.080 \| 5710.944 \| 12923.328 \| 17011.871 \| 0.627 \| 0.760 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3581.920 \| 5618.144 \| 12934.528 \| 16745.888 \| 0.638 \| 0.772 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.120 \| 71.232 \| 269.760 \| 295.680 \| 0.802 \| 0.912 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 49.408 \| 65.312 \| 242.304 \| 253.952 \| 0.756 \| 0.954 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.504 \| 72.544 \| 269.632 \| 298.976 \| 0.793 \| 0.902 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.760 \| 71.040 \| 269.600 \| 296.640 \| 0.813 \| 0.909 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 82.336 \| 147.168 \| 466.080 \| 487.456 \| 0.559 \| 0.956 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.040 \| 435.392 \| 453.248 \| 0.667 \| 0.961 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.856 \| 147.424 \| 465.920 \| 499.552 \| 0.555 \| 0.933 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.760 \| 146.656 \| 466.176 \| 485.984 \| 0.557 \| 0.959 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 206.976 \| 678.080 \| 866.976 \| 0.853 \| 0.782 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 121.664 \| 164.768 \| 538.240 \| 636.160 \| 0.738 \| 0.846 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 209.664 \| 677.696 \| 883.424 \| 0.842 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 177.440 \| 207.840 \| 677.248 \| 868.288 \| 0.854 \| 0.780 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.272 \| 449.536 \| 1163.424 \| 1420.832 \| 0.557 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 173.472 \| 305.376 \| 929.408 \| 1104.544 \| 0.568 \| 0.841 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 249.376 \| 454.976 \| 1163.648 \| 1455.296 \| 0.548 \| 0.800 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.368 \| 450.144 \| 1163.520 \| 1409.984 \| 0.556 \| 0.825 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2416.576 \| 2726.208 \| 6835.520 \| 10442.784 \| 0.886 \| 0.655 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1357.440 \| 1590.752 \| 4433.664 \| 5975.296 \| 0.853 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2427.360 \| 2747.040 \| 6853.056 \| 10670.784 \| 0.884 \| 0.642 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2441.120 \| 2718.944 \| 6836.640 \| 10433.792 \| 0.898 \| 0.655 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3555.392 \| 5620.960 \| 11944.000 \| 16504.801 \| 0.633 \| 0.724 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2010.848 \| 3241.152 \| 7636.064 \| 9870.464 \| 0.620 \| 0.774 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3557.440 \| 5688.352 \| 11935.744 \| 17090.496 \| 0.625 \| 0.698 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3562.720 \| 5630.432 \| 11939.168 \| 16392.033 \| 0.633 \| 0.728 \| </details> ### Perf after this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|----------------------------\| \| Average \| 0.776 \| \| \| \| \| \| Max \| 1.006 \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.566 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.817 \| \| \| \| \| \| Max \| 1.150 \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| \| Min \| 0.454 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.680 \| 17.056 \| 64.544 \| 73.376 \| 0.919 \| 0.880 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.712 \| 19.872 \| 65.408 \| 72.864 \| 0.791 \| 0.898 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.160 \| 17.280 \| 64.896 \| 73.888 \| 0.935 \| 0.878 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.192 \| 17.120 \| 64.896 \| 75.424 \| 0.946 \| 0.860 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.648 \| 22.496 \| 89.184 \| 82.592 \| 0.873 \| 1.080 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.320 \| 26.816 \| 91.264 \| 82.880 \| 0.758 \| 1.101 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.096 \| 22.528 \| 89.184 \| 83.776 \| 0.892 \| 1.065 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.680 \| 22.432 \| 89.184 \| 120.096 \| 0.877 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.384 \| 32.512 \| 119.232 \| 128.960 \| 0.996 \| 0.925 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.176 \| 37.248 \| 113.664 \| 119.520 \| 0.810 \| 0.951 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.512 \| 32.928 \| 119.264 \| 131.456 \| 0.987 \| 0.907 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.448 \| 32.704 \| 119.200 \| 128.352 \| 0.992 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.952 \| 62.176 \| 199.040 \| 214.304 \| 0.675 \| 0.929 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 39.744 \| 62.880 \| 189.504 \| 179.968 \| 0.632 \| 1.053 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.472 \| 62.784 \| 199.136 \| 217.664 \| 0.661 \| 0.915 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 42.048 \| 61.952 \| 199.168 \| 214.496 \| 0.679 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 341.184 \| 357.632 \| 980.256 \| 1328.896 \| 0.954 \| 0.738 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 212.576 \| 252.960 \| 673.888 \| 824.864 \| 0.840 \| 0.817 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.000 \| 363.296 \| 980.768 \| 1375.808 \| 0.936 \| 0.713 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.768 \| 356.832 \| 980.960 \| 1326.272 \| 0.955 \| 0.740 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 459.392 \| 737.120 \| 1678.240 \| 2205.248 \| 0.623 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 292.672 \| 468.096 \| 1178.016 \| 1371.584 \| 0.625 \| 0.859 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.144 \| 745.312 \| 1680.000 \| 2252.512 \| 0.620 \| 0.746 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.112 \| 736.576 \| 1679.008 \| 2216.480 \| 0.627 \| 0.758 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.064 \| 16.704 \| 105.120 \| 120.768 \| 0.962 \| 0.870 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.552 \| 18.144 \| 107.136 \| 121.696 \| 0.857 \| 0.880 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.096 \| 16.768 \| 102.688 \| 120.864 \| 0.960 \| 0.850 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.032 \| 16.576 \| 104.736 \| 124.672 \| 0.967 \| 0.840 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.392 \| 21.952 \| 104.736 \| 174.656 \| 0.883 \| 0.600 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 20.128 \| 23.712 \| 105.216 \| 199.008 \| 0.849 \| 0.529 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.904 \| 21.888 \| 103.744 \| 179.520 \| 0.909 \| 0.578 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.968 \| 21.952 \| 104.640 \| 177.312 \| 0.910 \| 0.590 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.096 \| 31.904 \| 118.720 \| 231.968 \| 1.006 \| 0.512 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.528 \| 33.952 \| 112.480 \| 218.304 \| 0.899 \| 0.515 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.160 \| 32.224 \| 118.752 \| 237.312 \| 0.998 \| 0.500 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.128 \| 32.032 \| 118.240 \| 233.120 \| 1.003 \| 0.507 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.280 \| 177.408 \| 350.688 \| 0.674 \| 0.506 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 39.552 \| 59.360 \| 168.832 \| 371.488 \| 0.666 \| 0.454 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.984 \| 61.696 \| 177.376 \| 360.416 \| 0.680 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.760 \| 177.184 \| 355.744 \| 0.669 \| 0.498 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.744 \| 357.888 \| 939.712 \| 1665.376 \| 0.949 \| 0.564 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 212.608 \| 248.832 \| 633.280 \| 1122.848 \| 0.854 \| 0.564 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.712 \| 363.232 \| 940.448 \| 1689.440 \| 0.935 \| 0.557 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 341.056 \| 355.264 \| 940.128 \| 1641.152 \| 0.960 \| 0.573 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.736 \| 741.024 \| 1569.824 \| 2559.552 \| 0.622 \| 0.613 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 293.856 \| 464.192 \| 1066.240 \| 1840.416 \| 0.633 \| 0.579 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.704 \| 753.152 \| 1570.112 \| 2641.088 \| 0.612 \| 0.594 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.832 \| 745.536 \| 1570.144 \| 2602.560 \| 0.618 \| 0.603 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.680 \| 41.280 \| 171.840 \| 158.176 \| 0.864 \| 1.086 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 31.360 \| 42.976 \| 158.912 \| 139.264 \| 0.730 \| 1.141 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.168 \| 41.600 \| 171.648 \| 161.344 \| 0.845 \| 1.064 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.136 \| 41.152 \| 171.808 \| 158.336 \| 0.854 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.832 \| 76.384 \| 295.680 \| 277.696 \| 0.639 \| 1.065 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.632 \| 72.512 \| 281.760 \| 250.752 \| 0.629 \| 1.124 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 49.504 \| 76.608 \| 295.584 \| 279.712 \| 0.646 \| 1.057 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.864 \| 75.904 \| 295.456 \| 277.568 \| 0.644 \| 1.064 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.392 \| 111.232 \| 408.640 \| 442.656 \| 0.894 \| 0.923 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 71.392 \| 95.168 \| 338.784 \| 341.760 \| 0.750 \| 0.991 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.808 \| 112.256 \| 408.608 \| 456.160 \| 0.889 \| 0.896 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 100.032 \| 110.816 \| 408.512 \| 444.192 \| 0.903 \| 0.920 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.040 \| 226.112 \| 726.880 \| 774.176 \| 0.597 \| 0.939 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 99.904 \| 169.696 \| 616.448 \| 607.104 \| 0.589 \| 1.015 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.488 \| 228.384 \| 727.776 \| 782.368 \| 0.593 \| 0.930 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.744 \| 225.664 \| 728.000 \| 773.600 \| 0.602 \| 0.941 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1324.192 \| 1387.808 \| 3866.944 \| 5217.184 \| 0.954 \| 0.741 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 738.464 \| 832.608 \| 2507.392 \| 3146.688 \| 0.887 \| 0.797 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.016 \| 1404.256 \| 3867.872 \| 5382.624 \| 0.944 \| 0.719 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.144 \| 1386.688 \| 3867.552 \| 5203.264 \| 0.956 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1847.488 \| 2866.336 \| 6612.704 \| 8597.696 \| 0.645 \| 0.769 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1066.592 \| 1660.640 \| 4357.696 \| 5174.016 \| 0.642 \| 0.842 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1850.464 \| 2905.408 \| 6616.928 \| 8793.280 \| 0.637 \| 0.752 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1848.896 \| 2834.720 \| 6623.872 \| 8637.920 \| 0.652 \| 0.767 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.384 \| 38.656 \| 150.336 \| 182.624 \| 0.941 \| 0.823 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.360 \| 38.112 \| 137.664 \| 171.840 \| 0.823 \| 0.801 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.608 \| 39.040 \| 150.528 \| 183.872 \| 0.938 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.064 \| 38.656 \| 150.560 \| 183.520 \| 0.933 \| 0.820 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.344 \| 76.352 \| 253.920 \| 301.440 \| 0.646 \| 0.842 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 46.720 \| 65.824 \| 239.424 \| 296.384 \| 0.710 \| 0.808 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.248 \| 76.416 \| 253.728 \| 307.808 \| 0.644 \| 0.824 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.376 \| 76.288 \| 253.728 \| 304.736 \| 0.647 \| 0.833 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.144 \| 364.960 \| 503.072 \| 0.901 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 71.136 \| 92.384 \| 294.432 \| 393.056 \| 0.770 \| 0.749 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.200 \| 111.360 \| 365.152 \| 512.640 \| 0.891 \| 0.712 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.240 \| 365.088 \| 504.224 \| 0.900 \| 0.724 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.680 \| 230.336 \| 613.472 \| 816.896 \| 0.589 \| 0.751 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 100.256 \| 165.088 \| 502.144 \| 676.480 \| 0.607 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.008 \| 232.480 \| 613.184 \| 836.672 \| 0.581 \| 0.733 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.232 \| 230.624 \| 613.536 \| 827.136 \| 0.586 \| 0.742 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1324.064 \| 1378.688 \| 3631.808 \| 5308.384 \| 0.960 \| 0.684 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 731.776 \| 826.688 \| 2263.168 \| 3241.344 \| 0.885 \| 0.698 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1316.128 \| 1403.200 \| 3625.088 \| 5550.688 \| 0.938 \| 0.653 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1311.904 \| 1378.880 \| 3616.320 \| 5353.696 \| 0.951 \| 0.675 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1837.856 \| 2887.392 \| 6121.632 \| 8586.656 \| 0.637 \| 0.713 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1066.976 \| 1654.368 \| 3843.136 \| 5291.040 \| 0.645 \| 0.726 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1854.208 \| 2896.832 \| 6130.112 \| 8745.984 \| 0.640 \| 0.701 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1860.512 \| 2889.344 \| 6135.648 \| 8750.592 \| 0.644 \| 0.701 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.640 \| 71.552 \| 315.968 \| 296.512 \| 0.847 \| 1.066 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 50.784 \| 71.040 \| 284.288 \| 258.880 \| 0.715 \| 1.098 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 61.312 \| 72.704 \| 315.680 \| 302.016 \| 0.843 \| 1.045 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.800 \| 71.776 \| 316.320 \| 297.152 \| 0.847 \| 1.065 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.576 \| 144.416 \| 580.576 \| 535.936 \| 0.586 \| 1.083 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.064 \| 123.648 \| 553.344 \| 481.376 \| 0.615 \| 1.150 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.160 \| 145.248 \| 581.024 \| 540.000 \| 0.579 \| 1.076 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.512 \| 143.552 \| 581.088 \| 535.776 \| 0.589 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.152 \| 209.408 \| 798.400 \| 868.704 \| 0.903 \| 0.919 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 127.552 \| 168.800 \| 650.816 \| 663.328 \| 0.756 \| 0.981 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.376 \| 211.360 \| 798.080 \| 895.552 \| 0.896 \| 0.891 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.440 \| 208.576 \| 797.888 \| 873.152 \| 0.908 \| 0.914 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 257.536 \| 441.760 \| 1408.960 \| 1514.720 \| 0.583 \| 0.930 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 179.328 \| 312.096 \| 1170.368 \| 1177.472 \| 0.575 \| 0.994 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 259.264 \| 446.944 \| 1408.768 \| 1530.400 \| 0.580 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 258.080 \| 440.480 \| 1408.864 \| 1514.144 \| 0.586 \| 0.930 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.808 \| 2771.456 \| 7616.704 \| 10405.248 \| 0.937 \| 0.732 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1435.744 \| 1610.336 \| 4927.520 \| 6220.000 \| 0.892 \| 0.792 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.264 \| 2745.056 \| 7611.232 \| 10631.392 \| 0.945 \| 0.716 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2576.256 \| 2735.456 \| 7626.400 \| 10346.976 \| 0.942 \| 0.737 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.744 \| 5634.816 \| 13077.056 \| 17182.528 \| 0.653 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2099.360 \| 3250.176 \| 8589.664 \| 10236.672 \| 0.646 \| 0.839 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3676.800 \| 5716.288 \| 13073.088 \| 17311.071 \| 0.643 \| 0.755 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.136 \| 5570.496 \| 13070.720 \| 17192.863 \| 0.660 \| 0.760 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.600 \| 71.008 \| 272.320 \| 300.000 \| 0.868 \| 0.908 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 50.176 \| 65.344 \| 241.568 \| 258.912 \| 0.768 \| 0.933 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.120 \| 72.512 \| 272.672 \| 305.408 \| 0.843 \| 0.893 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.248 \| 71.136 \| 272.640 \| 301.120 \| 0.861 \| 0.905 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.872 \| 146.784 \| 466.912 \| 496.832 \| 0.571 \| 0.940 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.072 \| 435.584 \| 462.112 \| 0.667 \| 0.943 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.392 \| 147.392 \| 466.656 \| 504.448 \| 0.566 \| 0.925 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.360 \| 146.688 \| 466.656 \| 499.040 \| 0.568 \| 0.935 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.024 \| 207.584 \| 684.768 \| 873.568 \| 0.911 \| 0.784 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 126.944 \| 164.288 \| 536.192 \| 645.984 \| 0.773 \| 0.830 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 188.768 \| 209.760 \| 684.096 \| 897.504 \| 0.900 \| 0.762 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.408 \| 207.776 \| 685.024 \| 876.384 \| 0.912 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 259.168 \| 449.536 \| 1167.936 \| 1433.280 \| 0.577 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 180.000 \| 305.312 \| 928.000 \| 1113.920 \| 0.590 \| 0.833 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 258.464 \| 455.136 \| 1167.808 \| 1462.848 \| 0.568 \| 0.798 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 257.824 \| 450.208 \| 1167.744 \| 1448.000 \| 0.573 \| 0.806 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2598.368 \| 2729.120 \| 7134.400 \| 10381.632 \| 0.952 \| 0.687 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1435.456 \| 1591.040 \| 4424.768 \| 6035.808 \| 0.902 \| 0.733 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2594.752 \| 2725.952 \| 7128.384 \| 10822.496 \| 0.952 \| 0.659 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2597.888 \| 2716.960 \| 7101.568 \| 10385.440 \| 0.956 \| 0.684 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3647.648 \| 5581.632 \| 12089.952 \| 16667.233 \| 0.654 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2093.952 \| 3241.440 \| 7579.392 \| 9847.936 \| 0.646 \| 0.770 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3650.528 \| 5650.688 \| 12105.568 \| 16963.680 \| 0.646 \| 0.714 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3680.064 \| 5585.312 \| 12117.504 \| 16935.040 \| 0.659 \| 0.716 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505 Approved by: https://github.com/Chillee	2024-09-10 09:30:02 +00:00
Roy Hvaara	23b1486185	[MPS] Allow nan mean reduction in `nll_loss` (#135434 ) This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162. Fixes #134431 Ref #64572 #119108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434 Approved by: https://github.com/malfet	2024-09-10 08:37:59 +00:00
Victor Tao	9902b349cb	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-10 07:27:55 +00:00
Tugsbayasgalan Manlaibaatar	5a9ac83e94	Fix doc (#135551 ) Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551 Approved by: https://github.com/yushangdi ghstack dependencies: #135549	2024-09-10 07:18:44 +00:00
Sam Larsen	1adf28a5c0	[inductor] print triton float64 constants correctly (#135260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260 Approved by: https://github.com/jansel	2024-09-10 07:05:02 +00:00
Tugsbayasgalan Manlaibaatar	c18052da0e	Add some minor doc improvement and ban using training IR for unflattener (#135549 ) Title Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549 Approved by: https://github.com/yushangdi	2024-09-10 06:48:42 +00:00
Yichen Yan	c0d2f991b1	Increase `TRITON_MAX_BLOCK['X']` (#135181 ) Fixes #135028 As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181 Approved by: https://github.com/jansel	2024-09-10 05:54:37 +00:00
Thomas Bohnstingl	e889252493	Implementation of scan (#134102 ) This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions. @ydwu4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102 Approved by: https://github.com/ydwu4	2024-09-10 04:51:16 +00:00
Avik Chaudhuri	6546c6186d	do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518 ) Test Plan: added test Differential Revision: D62395371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518 Approved by: https://github.com/zhxchen17	2024-09-10 03:47:36 +00:00
Chien-Chin Huang	1d9fefff19	[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535 ) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: https://github.com/pytorch/pytorch/issues/133415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535 Approved by: https://github.com/wz337	2024-09-10 03:10:00 +00:00
zengxian	7ec17b49cf	Fix dynamo benchmark skip logic for cpu device (#135193 ) Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel	2024-09-10 03:02:19 +00:00
Wu, Chunyuan	146921007a	[inductor] [cpp] fix the input contiguous check in max-autotune (#134982 ) ## Description Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm. In this PR, we check whether input is contiguous using the following way: If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous. ## Additional context The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails: `d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)` And it finally runs into this `copy_input` and returns a `FlexibleLayout`. `d14fe3ffed/torch/_inductor/ir.py (L4722)` When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model. The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](`d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)`) which calls [slice_nd](`d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)`) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](`d14fe3ffed/torch/_inductor/ir.py (L2288)`) invokes [decide_layout](`d14fe3ffed/torch/_inductor/ir.py (L2135)`) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-10 02:47:38 +00:00
Yueming Hao	a71e5509bc	[inductor]Add profiler to operatorbench (#135515 ) Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure. <img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515 Approved by: https://github.com/shunting314	2024-09-10 02:33:30 +00:00
Guilherme Leobas	136e28f616	Enable forward AD in functional.affine_grid (#135494 ) Fixes #121411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494 Approved by: https://github.com/zou3519, https://github.com/soulitzer	2024-09-10 00:07:07 +00:00
Jeff Daily	39a61795e3	remove amax_ptr from scaled_gemm (#135421 ) amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-09-09 23:04:36 +00:00
Scott Wolchok	b4feec9782	[xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529 ) Building XNNPACK as a static library has some issues because of multiple global params floating around. Let's try to get rid of it in xplat and see how it fares. Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529 Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign	2024-09-09 22:47:01 +00:00
Yanbo Liang	d81731615f	[Dynamo] Adding CallFunctionNoArgsSource and (#135425 ) CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device() Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425 Approved by: https://github.com/anijain2305	2024-09-09 22:46:00 +00:00
shubhambhokare1	e2f9a83b85	[ONNX] Drop final None values as inputs for nodes in exporter graph (#135520 ) When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 22:28:41 +00:00
PyTorch MergeBot	70a65a8bd5	Revert "NJT <-> padded dense conversions (#125947 )" This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942. Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test `09a5e88bef`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))	2024-09-09 22:01:09 +00:00
PyTorch MergeBot	689d278543	Revert "Add `__init__.py` to shape inference folder. (#135461 )" This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715. Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))	2024-09-09 21:55:13 +00:00
atalman	9b764491e3	Use upload-artifact@v4.4.0 for create_release.yml (#135528 ) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007 Due broken sync ``` actions/upload-artifact@v2 and actions/download-artifact@v4.1.7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-09 20:48:52 +00:00
Maclyn Brandwein	cbc6b30a24	Fix broken E2E tests on Linux machines (#135394 ) Summary: I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught -- `ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job: https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job Test Plan: `arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'` -> https://www.internalfb.com/sandcastle/workflow/256705178764255769 Differential Revision: D62321167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394 Approved by: https://github.com/laithsakka	2024-09-09 20:18:08 +00:00
PyTorch MergeBot	5b368de7f7	Revert "[ONNX] Update fake mode usage in onnx docs (#135512 )" This reverts commit a13c118994b4f118388d97a35abcb91a396cd437. Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))	2024-09-09 20:15:12 +00:00
Joel Schlosser	09a5e88bef	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-09 19:37:32 +00:00
Sahan Paliskara	a4e6a0b240	[split build] move periodic split builds into own concurrency group (#135510 ) To avoid nightly workflows cancelling each other Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-09 19:35:57 +00:00
imShZh	4ab232d0c4	Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433 ) Fixes #135432 In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case. In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong. This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433 Approved by: https://github.com/ezyang	2024-09-09 19:32:18 +00:00
Sergii Dymchenko	2032f107d7	Don't try to tag s390x docker images (#135509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509 Approved by: https://github.com/atalman	2024-09-09 19:07:48 +00:00
rzou	5f7d956362	Fix bugs blocking flipping the default layout constraint for custom ops (#135391 ) Fixes two things: - For regular PyTorch ops, the default layout constraint tag is always flexible_layout. This was a bug with #135238 - Mark the new quantized _wrapped_linear_prepack ops as flexible_layout. The metas for these are incorrect, I didn't want to fix them (and changing the default requires the metas actually be correct). Test Plan: - The next PR up in the stack. The PRs are split because the next one is riskier. foo Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391 Approved by: https://github.com/albanD	2024-09-09 18:24:21 +00:00
shubhambhokare1	a13c118994	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby	2024-09-09 18:10:37 +00:00
Chien-Chin Huang	21241bfeee	[CP] Extend CP to support load-balancing shards (#132442 ) This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442 Approved by: https://github.com/wconstab	2024-09-09 18:04:38 +00:00
PyTorch MergeBot	73a6fc6e30	Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314 )" This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da. Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](`011cae9570`) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))	2024-09-09 17:33:01 +00:00
Roy Hvaara	09287e3af4	[MPS] Add regression test for `fft.fftfreq` (#135440 ) The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440 Approved by: https://github.com/ezyang	2024-09-09 17:12:36 +00:00
Bin Bao	16c3b8f87c	[AOTI] Fix assert_function call in cpu autotune template (#135086 ) Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086 Approved by: https://github.com/chenyang78, https://github.com/angelayi ghstack dependencies: #134857	2024-09-09 16:54:12 +00:00
Bin Bao	9c6dff4941	[AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857 ) Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857 Approved by: https://github.com/angelayi	2024-09-09 16:54:12 +00:00
atalman	0eb425a563	[Release] Apply Release changes scripts after release 2.4 (#135495 ) Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495 Approved by: https://github.com/kit1980	2024-09-09 16:49:04 +00:00
Victor Tao	011cae9570	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-09 16:24:58 +00:00
CaoE	dfb2b661f7	Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525 ) Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-09 15:31:38 +00:00
Roy Hvaara	5a69e0ebbe	[MPS] Update decorator comments with issue ref (#135448 ) Updating the comments with references to better places for context now that the bugs have been identified. xref #135442 #135447 #134184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448 Approved by: https://github.com/ezyang	2024-09-09 15:18:52 +00:00
Xavier Dupré	5e145861f2	[ONNX] Improves documentation of ONNX exporter (#135372 ) The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 15:09:01 +00:00
Yuxin Wu	c35b953531	Fix wrong error msg (#135423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423 Approved by: https://github.com/ezyang	2024-09-09 13:28:31 +00:00
PHLens	dced0d6d9f	Add `__init__.py` to shape inference folder. (#135461 ) Fixes #135196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461 Approved by: https://github.com/ezyang	2024-09-09 13:27:58 +00:00
Jiong Gong	c0436c5701	[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686 ) (#135438 ) Fix #134686. PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438 Approved by: https://github.com/leslie-fang-intel	2024-09-09 05:16:02 +00:00
cyy	60e8dc4374	Check function declarations in Caffe2 code (#134925 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925 Approved by: https://github.com/ezyang	2024-09-09 05:03:29 +00:00
xingyunjohn1	e6c3f58584	Fix example: Address broadcasting error in the addition of `attn_bias… (#135427 ) …` and `attn_mask`, and correct device assignment for newly created variables in the method. Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method. 1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired. 2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device. This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> @mikaylagawarecki provided a more elegant implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427 Approved by: https://github.com/ezyang	2024-09-09 03:47:34 +00:00
PhilipMay	90e12cf63d	Fix return type of `nansum` example. (#135435 ) One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435 Approved by: https://github.com/ezyang	2024-09-09 03:34:52 +00:00
Zhou, Lingzhi	44c08f4984	[Partitioner] Query whether nodes exist in graph faster (#135316 ) Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316 Approved by: https://github.com/ezyang	2024-09-09 03:34:02 +00:00
Rafal Litka	b6186353c6	enable lazy_init for hpu (#135203 ) enables lazy_init for hpu device Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203 Approved by: https://github.com/ezyang	2024-09-09 03:32:20 +00:00
Alexander Kurakin	b7eb7256fb	docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve (#135417 ) docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve /cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/135417 Approved by: https://github.com/ezyang	2024-09-09 03:16:11 +00:00
Xu Han	c1ae78be92	[inductor] calibration inductor windows uts (18/N) (#135449 ) skip test_quantized_* UTs of `test/inductor/test_cpu_select_algorithm.py`. Windows inductor don't support quantize so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135449 Approved by: https://github.com/ezyang	2024-09-09 03:10:54 +00:00
yuqingj	defb515306	[NJT]Add permute ops support (#135336 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135336 Approved by: https://github.com/davidberard98	2024-09-08 21:00:41 +00:00
Jason Ansel	31c4e0d37d	[inductor] Cleanup analysis done at lowering time (#135412 ) Before this we would take multiple passes over the body of each IRNode as we did lowering. This combines most analysis into `OpCounterCSE` so it can be done in a single pass. Before: ![image](https://github.com/user-attachments/assets/0047db09-4258-4491-a9a6-b078e183092a) After: ![image](https://github.com/user-attachments/assets/1e03adcb-8303-4bb1-8bbb-cc42dacd44d7) This stack: ![image](https://github.com/user-attachments/assets/d6b50b24-c30c-4d23-8b1a-344b3ba65d7a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135412 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377, #135400	2024-09-08 18:02:36 +00:00
Jason Ansel	53290ca00b	[inductor] Refactor BaseSchedulerNode.__init__ (#135400 ) Might be a small compile time improvement since we remove a call to extract_read_writes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135400 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377	2024-09-08 18:02:36 +00:00
Jason Ansel	16f5155992	[inductor] Fast path for extract_read_writes without tracing (#135377 ) Before (bottom of stack): ![image](https://github.com/user-attachments/assets/13060ff9-b31d-42a9-8e8f-c50b2bf3dc2f) After (this PR): ![image](https://github.com/user-attachments/assets/7d190821-b614-46b7-9e9e-9087443df654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135377 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306	2024-09-08 18:02:32 +00:00
Jason Ansel	37144be03d	[inductor] Remove ReadWrites.op_counts (#135306 ) This was (almost) unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135306 Approved by: https://github.com/oulgen ghstack dependencies: #135286	2024-09-08 18:02:28 +00:00
Jason Ansel	3bdc54ed18	[inductor] Refactor LoopBody.memory_usage (#135286 ) This is preparing for some other changes where I speed up extract_read_writes tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135286 Approved by: https://github.com/oulgen	2024-09-08 18:02:24 +00:00
cyy	2196f32475	[22/N] Fix clang-tidy warnings in jit (#135319 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135319 Approved by: https://github.com/titaiwangms	2024-09-08 17:18:29 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Animesh Jain	20cab91a12	[dynamo] Remove skip from jit freeze tests (#135281 ) Fixes https://github.com/pytorch/pytorch/issues/119781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135281 Approved by: https://github.com/zou3519	2024-09-08 15:11:12 +00:00
CaoE	a6fae2e811	Use BRGEMM for Half flash attention forward kernel (#131879 ) Use oneDNN BRGEMM on packed data to get better performance on the 5th generation of Xeon where Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16. Multiple models have achieved acceleration, for instance, FP16 stable diffusion v2.1 has achieved over 50% improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131879 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #131878	2024-09-08 12:32:23 +00:00
Justin Chu	042f2f7746	[ONNX] Re-raise the exception if the dynamic shapes cannot be refined (#135418 ) Improve error reporting. Otherwise users will just see not being able to refine shapes most of the time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135418 Approved by: https://github.com/titaiwangms	2024-09-08 05:30:34 +00:00
Huamin Li	fd494dd426	Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401 ) Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix Differential Revision: D62325142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401 Approved by: https://github.com/houseroad	2024-09-08 04:16:24 +00:00
Bob Ren	8334cb2fb9	remove commented out breakpoints (#135363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135363 Approved by: https://github.com/oulgen	2024-09-08 02:15:45 +00:00
Yanbo Liang	e72ed4717e	[Dynamo] Fix Huggingface PretrainedConfig get non const attr (#135413 ) Fixes #135329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135413 Approved by: https://github.com/anijain2305	2024-09-07 19:16:29 +00:00
drisspg	3bebc09be9	[FlexAttention] Align the matmul tensorcore usage (#135168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135168 Approved by: https://github.com/Chillee	2024-09-07 16:33:41 +00:00
Sam Larsen	a2db22e6bb	[inductor] Catch BrokenProcessPool and print a more helpful message. (#135120 ) Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like: ``` ... File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module> async_compile.wait(globals()) File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait raise RuntimeError( RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120 Approved by: https://github.com/Chillee	2024-09-07 16:33:37 +00:00
Jason Ansel	eac5e12548	[inductor] Move LoopBody to its own file (#135257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257 Approved by: https://github.com/oulgen	2024-09-07 16:29:15 +00:00
Wu, Chunyuan	18479c5f70	[Doc] update max-autotune for CPU (#134986 ) The current doc for `max-autotune` is applicable only for GPU. This PR adds the corresponding content for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134986 Approved by: https://github.com/jgong5, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-07 13:42:40 +00:00
CaoE	f7c0c06692	Add oneDNN BRGEMM support on CPU (#131878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-07 13:22:30 +00:00
Yu, Guangye	b53d97c7be	[Intel GPU] Add XPU memory-related APIs (#129919 ) # Motivation According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification. But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification. # Additional Context Fixes: #127929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919 Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #130923	2024-09-07 11:15:17 +00:00
Yu, Guangye	6c1da66407	[Reland] Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-09-07 11:14:17 +00:00
Jiong Gong	d7c97e7245	[inductor][cpp][gemm] cache blocking config for dynamic shapes (#133538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133538 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277, #133447 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	be9f4ffe88	[inductor][cpp][gemm] enable dynamic M for k-slicing (#133447 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133447 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	692faa9bc6	[inductor][cpp][gemm] reduce memory alloc overhead by allocating local acc once per thread (#135277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135277 Approved by: https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:25 +00:00
Justin Chu	32f3af72b7	[ONNX] Support FakeTensor in ONNXProgram (#135399 ) Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights. An error is raised when users try to serialize a FakeTensor to avoid segfaults. Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399 Approved by: https://github.com/titaiwangms	2024-09-07 04:48:18 +00:00
Yanbo Liang	ebab5c85c4	[FlexAttention] Skip very small block size unit tests on H100 due to Triton bug (#135393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135393 Approved by: https://github.com/BoyuanFeng	2024-09-07 04:35:22 +00:00
Justin Chu	3d734d837b	[ONNX] Handle mixed sequence inputs properly (#135378 ) Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like ```pytb Traceback (most recent call last): File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op converted_named_inputs = _process_python_constants_and_sequences( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences raise TypeError( TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported ``` This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs. Synced from https://github.com/justinchuby/torch-onnx/pull/187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378 Approved by: https://github.com/titaiwangms	2024-09-07 03:07:39 +00:00
Yiming Zhou	c92227c41a	[quant][pt2e] fix placeholder typo and related quantization tests (#135379 ) A previous typo on "placeholder" and related tests in quantization are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135379 Approved by: https://github.com/jerryzh168	2024-09-07 02:31:43 +00:00
blaine-rister	e6a0221fc6	[Inductor] Optionally allow padding on non-GPU devices (#135280 ) This is the OSS component of a larger MTIA diff. Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA. This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs. Differential Revision: D61038114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280 Approved by: https://github.com/jfix71, https://github.com/shunting314	2024-09-07 02:19:14 +00:00
Justin Chu	a6b9d444fb	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-07 00:50:15 +00:00
Sergii Dymchenko	d42b0c8f22	Add release matrix for 2.5 (#135383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135383 Approved by: https://github.com/huydhn	2024-09-07 00:49:53 +00:00
Will Feng	941d094dd1	[Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315 ) Before the fix, the unit test will fail at forward Dynamo tracing: ``` File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp loss = compiled_replicate_model(data).sum() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant from user code: File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor result = DTensor.from_local( ``` After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474). I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for. Fixes https://github.com/pytorch/pytorch/issues/130978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315 Approved by: https://github.com/bdhirsh	2024-09-07 00:11:25 +00:00
Shangdi Yu	b1a934741e	Change test_constant_prop_preserve_metadata (#135268 ) Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata ``` Reviewed By: angelayi Differential Revision: D62219974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268 Approved by: https://github.com/angelayi	2024-09-07 00:02:35 +00:00
Sahan Paliskara	0c661f3e1a	[Split Build] Refactor split build binary builds into their own workflows and move split build binary builds to periodic (#134624 ) As we need to move split build binary tests from trunk to periodic this pr, refactors those jobs out into its own workflow to achieve this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134624 Approved by: https://github.com/malfet	2024-09-06 23:57:56 +00:00
leslie-fang-intel	2c7e314803	[Inductor][CPP] Fix the issue of view dtype (#135301 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 23:36:44 +00:00
Sun, Jiayi	ead4407f57	[inductor] Fix loop split optimization (#135303 ) Fix https://github.com/pytorch/pytorch/issues/135274. Improve the check whether the div expr matches: add a check whether `split_var` is in `original_body.iter_vars`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135303 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-06 23:06:25 +00:00
Henry Tsang	2f5b40c099	[aoti test] Disable FP8 funz dtypes in fp8 runtime check test (#135373 ) Fixing https://github.com/pytorch/pytorch/issues/126734 Key is the funz FP8 types are for AMD only. source: https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/135373 Approved by: https://github.com/chenyang78	2024-09-06 23:05:47 +00:00
Yidi Wu	993b5647ab	[export] fix placeholder name collision tests by removing map call (#135366 ) The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366 Approved by: https://github.com/angelayi	2024-09-06 22:02:50 +00:00
Sam Larsen	2ab26806f1	Require tlparse for failing tests in test_structured_trace.py (#135376 ) Summary: These tests are currently failing internally. Per discussion, skip if tlparse is unavailable Test Plan: ``` feature remove tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py feature install tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py ``` Differential Revision: D62310342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135376 Approved by: https://github.com/ezyang	2024-09-06 21:53:41 +00:00
Jane Xu	b1612569f6	[BE] Clarify defaulting behavior in optimizer (#135384 ) Fixes #135340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384 Approved by: https://github.com/drisspg, https://github.com/jainapurva	2024-09-06 21:52:55 +00:00
Will Constable	dc0e818738	[FR] Automatically infer a common filename prefix (#135158 ) Save the annoyance of specifying this on the command line each time Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #135157	2024-09-06 21:44:27 +00:00
Will Constable	06e414d7fe	[FR] Make trace_dir a required argument (#135157 ) Ensures users get a clean error if they forget to specify the dir, and improves the help message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-09-06 21:44:27 +00:00
PyTorch MergeBot	a681260caf	Revert "[ONNX] Refactor exporter errors (#135180 )" This reverts commit 5eebd9315a72422d59b6f8d8ca8e4e573e231d5c. Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](`5eebd9315a`), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))	2024-09-06 21:39:18 +00:00
William Wen	95e976a63f	[dynamo] recursively skip frames when Dynamo cache limit is hit (#135144 ) Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723). In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-06 21:38:53 +00:00
Catherine Lee	306ac44eaa	[ez][TD] Fix request for issue body returns None (#135389 ) I assumed it would be empty string if the body is empty, but its just None Pull Request resolved: https://github.com/pytorch/pytorch/pull/135389 Approved by: https://github.com/malfet	2024-09-06 21:02:01 +00:00
Vadym Khortiuk	a7643baceb	Revert expectFailureIf condition on tests with torch.compile on Windows (#134759 ) Fixes #134716 This PR reverts some changes introduced in `6eae569546` (#133987) torch.compile is not available on Windows, tests should be expected to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134759 Approved by: https://github.com/malfet	2024-09-06 20:51:55 +00:00
William Wen	a4030e37be	[dynamo] reland map/zip iterator related changes (#135074 ) Differential Revision: [D62211019](https://our.internmc.facebook.com/intern/diff/D62211019) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135074 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos	2024-09-06 20:38:02 +00:00
Henry Tsang	22e1fb6faa	[test][easy] Add debug utils for cpu select algorithm test (#135038 ) Summary: Add debug utils to debug a flaky test in fbcode ci. Some context: https://github.com/pytorch/pytorch/pull/126545 Test Plan: ci Differential Revision: D62005445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135038 Approved by: https://github.com/jgong5, https://github.com/XuehaiPan	2024-09-06 20:30:49 +00:00
titaiwangms	2a4890e315	[ONNX] Clean up the missed lines from previous PRs (#135368 ) Some missed deleted lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/135368 Approved by: https://github.com/justinchuby	2024-09-06 20:27:52 +00:00
Tristan Rice	3ce433aef2	[TCPStore] use wait counters (#135283 ) This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove. Test plan: Existing tests + build There's no OSS backend for wait counters so can't write any tests with them currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283 Approved by: https://github.com/c-p-i-o	2024-09-06 19:54:25 +00:00
Jane Xu	7f2d20e687	Run all autograd node post hooks (#134728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134728 Approved by: https://github.com/albanD, https://github.com/soulitzer	2024-09-06 19:44:28 +00:00
titaiwangms	32fd29c1ea	[ONNX] Properly handle Attributes in traceable functions (#135367 ) Previously the attributes were sent in as Attr objects even when we call the function as a plain Python function. Turning them into python objects. From https://github.com/justinchuby/torch-onnx/pull/186 Related https://github.com/microsoft/onnxscript/issues/1846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135367 Approved by: https://github.com/justinchuby	2024-09-06 19:35:22 +00:00
Justin Chu	5eebd9315a	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-06 19:10:56 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
Jokeren	b143426db3	[Inductor] Use argument names as the key for the `constants` dict and the `signature` dict (#135170 ) Referencing how triton constructs these dictionaries `ca3fb5f6fa/python/triton/runtime/jit.py (L639)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135170 Approved by: https://github.com/htyu	2024-09-06 19:05:00 +00:00
Oguz Ulgen	13ba0a2e5c	Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175 ) Fixes #135172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175 Approved by: https://github.com/masnesral, https://github.com/ezyang	2024-09-06 19:03:57 +00:00
wdziurdz	8520ce5f78	Fix incorrect trace of post-accumulate grad hook on tensor with zero dims (#135226 ) Fix incorrect trace of post-accumulate grad hook on tensor with zero dimensions Fixes #135207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135226 Approved by: https://github.com/xmfan	2024-09-06 18:19:54 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Pian Pawakapan	177e4f4218	remove _check call on item() for torch.istft (#135234 ) Fixes #135014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135234 Approved by: https://github.com/tugsbayasgalan	2024-09-06 17:31:25 +00:00
Henry Tsang	3988b3468b	[aoti][easy] remove breakpoint() in wrapper.py (#134807 ) Differential Revision: D61687146 Remove an unintended breakpoint in code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807 Approved by: https://github.com/YUNQIUGUO	2024-09-06 17:25:05 +00:00
Zhengxu Chen	04118d8617	[export] Record the global torch version in serialization. (#135243 ) Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version. Test Plan: CI Reviewed By: henryoier Differential Revision: D62252626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243 Approved by: https://github.com/yushangdi	2024-09-06 17:02:06 +00:00
Riley Dulin	24482e5c68	[torch][fx] Set maximum warning count during fx.Graph.lint (#135069 ) Summary: resnet152 spent about 15 minutes writing warning messages in _unlift during `to_executorch` because they're all written to unbuffered stderr by the `warnings` module. These warnings are almost always about get_attr nodes referencing a non-existent name: ```lang=py warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does ' 'not reference an nn.Module, nn.Parameter, or buffer, which is ' 'what \'get_attr\' Nodes typically target' ) ``` I'm not aware of a way to configure the warnings module to write this out at most once, so I'm just going to disable the lint for now. Test Plan: Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now Differential Revision: D62156090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069 Approved by: https://github.com/yushangdi	2024-09-06 16:41:59 +00:00
yanbing-j	c0ec599f27	Update submodule ideep to include aarch64 change (#134897 ) This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334. Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897 Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal	2024-09-06 16:40:26 +00:00
Alfredo Tupone	7074de43c0	Porting to GCC 15 (#135188 ) uint8_t is found on cstdint header Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188 Approved by: https://github.com/Skylion007	2024-09-06 16:16:53 +00:00
Rachel Guo	771dcce11d	[AOTI][Tooling][6/n] Fix long dtype input tensors calling `mean()` in `aoti_torch_print_tensor_handle` (#135072 ) Differential Revision: D61635232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135072 Approved by: https://github.com/hl475, https://github.com/ColinPeppler	2024-09-06 15:59:32 +00:00
Avik Chaudhuri	de74aafff4	error on exporting ScriptModule (#135302 ) Test Plan: added test Differential Revision: D62279179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135302 Approved by: https://github.com/yushangdi	2024-09-06 15:12:40 +00:00
rzou	ad29a2c0dc	Add Inductor config for default stride behavior (#135238 ) By default, Inductor is allowed to manipulate the layout (strides+storage offset) of input tensors to custom operators. We want to change it so that the default is that Inductor should respect the stride order of input tensors to custom operators. This PR adds a config to toggle the behavior, in the next PR up we'll change the default. We also make the following changes: - We add a new operator Tag (flexible_layout), which means that inductor is allowed to manipulate the layout. When we flip the default, users can specify they want the old behavior by using this tag. This is a reland of https://github.com/pytorch/pytorch/pull/126986, which was previously reverted due to silent incorrectness. We've since fixed the silent incorrectness (https://github.com/pytorch/pytorch/pull/133639) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238 Approved by: https://github.com/albanD	2024-09-06 14:48:24 +00:00
Yiwen Shi	3a9e33dca8	[torchelastic] Don't do signal handling when off the main thread (#135088 ) Summary: In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error: > "ValueError('signal only works in main thread of the main interpreter')" To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling. Test Plan: Before this change, MAST job failed: https://fburl.com/mlhub/iq2m10v8 With this change, MAST job succeeded: https://fburl.com/mlhub/q6kb8343 Differential Revision: D62166943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088 Approved by: https://github.com/d4l3k	2024-09-06 14:47:03 +00:00
David Berard	a086882d72	[inductor][triton] mark workspace args as mutated (#134648 ) SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such. Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed. When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected. `804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-09-06 14:23:37 +00:00
Will Feng	84ae6b7d6b	AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193 ) This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960: Part of FSDP2's tracing strategy right now is that: (1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated (2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views) (3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup. It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960. However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through). So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when: (1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state) (2) and our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193 Approved by: https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-09-06 14:06:48 +00:00
Julia Guo	60a097a071	[CD] Update binary_linux_test.sh to include calling builder smoke test (#133869 ) Run smoke test Fixes #1969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133869 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-09-06 13:27:24 +00:00
Wu, Chunyuan	13bae39e22	[inductor] [cpp] improve cache blocking for is_dynamic_M (#131306 ) ## Performance Models with >= 3% performance speedup are listed below: ### AMP single-thread dynamic shape (measured on CPU with AMX support) No regressions \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| soft_actor_critic\| 3% Pull Request resolved: https://github.com/pytorch/pytorch/pull/131306 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel ghstack dependencies: #135275 Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-06 13:21:24 +00:00
Jiong Gong	4ef6c05f65	[inductor][cpp][gemm] fix autotune runtime error from linear_binary fusion (#135275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135275 Approved by: https://github.com/leslie-fang-intel	2024-09-06 13:21:23 +00:00
Edward Z. Yang	d6b9bd3e60	Also handle compiler collective when input variable doesn't exist on all ranks (#135147 ) Internal xref: https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135147 Approved by: https://github.com/jansel	2024-09-06 13:18:36 +00:00
Edward Z. Yang	d0591f4658	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that: I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4	2024-09-06 13:13:15 +00:00
Yan Zhiwei	b5dea061c8	check compilation status before query cudnn version in conv (#135332 ) This PR is created for fixing the https://github.com/pytorch/pytorch/issues/135322. The cudnn compilation status should be check firstly before querying version, otherwise, conv may trigger runtimeerror before any check in other non-cuda backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135332 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-06 12:50:04 +00:00
Michael Lazos	041960a1ce	[Dynamo] Automatically in-graph traceable tensor subclass ctors (#135151 ) Fixes https://github.com/pytorch/pytorch/issues/114389 Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151 Approved by: https://github.com/bdhirsh	2024-09-06 12:23:38 +00:00
Sun, Jiayi	67c7924ea1	[inductor] Fix gen_transposed_tile_load_store (#135307 ) Recent PR: https://github.com/pytorch/pytorch/pull/131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170 reproduce UT: ```cmd pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu ``` Original generated code: ```c++ alignas(16) float tmp1[static_cast<int64_t>(((-256LL)(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LLks1))]; ``` Changes: allocate a large-enough fixed-sized buffer. New genarated code: ```c++ alignas(16) float tmp1[16*16]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135307 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 10:44:08 +00:00
penguin-wwy	217ba7b2ab	[Docs] Update FileCheck doc (#135199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135199 Approved by: https://github.com/soulitzer	2024-09-06 08:18:38 +00:00
CaoE	758d515d98	[Inductor][CPP] Select tiling factor for lower precision data types (#133830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133830 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 08:12:37 +00:00
Feng Yuan	60d98b4cfb	Update torch-xpu-ops pin (ATen XPU implementation) (#135300 ) Release cycle for PyTorch 2.5 1. Bugfixing: correct reduction logic in cdist kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300 Approved by: https://github.com/EikanWang	2024-09-06 07:30:09 +00:00
Shangdi Yu	590a3e9f8a	[export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525 ) Summary: In graph of TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir. This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv ``` Differential Revision: D61364547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525 Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168	2024-09-06 07:06:06 +00:00
drisspg	764ee6e3f9	[FlexAttention] Specify padding_value for boundary checked loads (#134573 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134573 Approved by: https://github.com/Chillee	2024-09-06 06:47:26 +00:00
wz337	67f98a99a4	[DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271 Approved by: https://github.com/fduwjj	2024-09-06 06:23:20 +00:00
fduwjj	e020a8755a	[Fix][FR][ez] Remove debugging logs (#135308 ) Removing the print added during debugging process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308 Approved by: https://github.com/wz337	2024-09-06 06:14:33 +00:00
Jason Ansel	7ffb3b201c	[inductor] Remove LoopBody.reads,writes,other (#135256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135256 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079, #135235	2024-09-06 06:11:55 +00:00
Jason Ansel	f946bf88c4	[inductor] Skip retracing an existing LoopBody (#135235 ) This is roughly a 7% speedup in inductor compile time for hf_Bert_large. The time spent in `LoopBody.__init__` improves from 15% to 8% of `fx_codegen_and_compile`. Before ![image](https://github.com/user-attachments/assets/7de0f28e-35bd-472f-b4be-b52733d2a85c) After ![image](https://github.com/user-attachments/assets/5f0cf11a-43c5-43ae-b13c-f32383a75a7f) Overall ![image](https://github.com/user-attachments/assets/6a369d8c-fb5e-4ad2-9504-0fc745ad6568) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135235 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079	2024-09-06 06:11:55 +00:00
Jason Ansel	66da3b3b2a	[fx] Bypass custom __setattr__ in Node.__init__ (#135079 ) Before: ![image](https://github.com/user-attachments/assets/5f0a6ae6-6049-44d0-b5f2-a549a23ad97f) After: ![image](https://github.com/user-attachments/assets/51c9f91b-f8a0-4043-8362-65813feec823) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135079 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084	2024-09-06 06:11:46 +00:00
Laith Sakka	41e653456e	[RDP] Fix "No module named 'libfb’" (#135244 ) Summary: D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended. This changes is_fbcode() to be a much stricter check. Test Plan: ``` buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH ``` Differential Revision: D62237502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244 Approved by: https://github.com/aorenste	2024-09-06 04:52:31 +00:00
chilli	e40a0a9359	Add randomness checking for sdpa vmap (#135176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135176 Approved by: https://github.com/zou3519	2024-09-06 04:50:49 +00:00
Xuan Zhang	c05a7adb36	[inductor][debug] fix draw_buffers (#135266 ) Before: ![image](https://github.com/user-attachments/assets/aac756f3-1349-4647-9da3-87cf105cf647) After: <img width="791" alt="image" src="https://github.com/user-attachments/assets/d72c663c-e598-42fa-ac40-9e58956f1ec1"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135266 Approved by: https://github.com/yf225	2024-09-06 04:12:41 +00:00
hippocookie	5f57be7571	[Distributed] Change function call in test to non-deprecated to eliminate warning (#134938 ) Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed - from deprecated `save_state_dict` change to `save` - from deprecated `load_state_dict` change to `load` Warning message: ```bash pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938 Approved by: https://github.com/wz337, https://github.com/fegin	2024-09-06 03:25:09 +00:00
Xu Han	29d72c1100	[inductor] check intel compiler minimal version (#135209 ) On Windows: early version icx has `-print-file-name` issue, and can't preload correctly for inductor. Add minimal version check for Intel compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135209 Approved by: https://github.com/ezyang	2024-09-06 03:21:07 +00:00
leslie-fang-intel	3b1a334c0f	[Inductor][CPP] Avoid mistake wgt tensor delete (#135100 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not. TestPlan ``` python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100 Approved by: https://github.com/jgong5	2024-09-06 03:13:36 +00:00
leslie-fang-intel	07689a38bf	[Inductor] Fix AOT weight alignment issue on CPU (#135205 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-09-06 03:06:51 +00:00
Edward Z. Yang	06a7dc21c1	Remove dead expect_rational (#135105 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135105 Approved by: https://github.com/malfet	2024-09-06 02:57:27 +00:00
Edward Z. Yang	d9a18173fa	Report qualname of exception type rather than <class 'RuntimeError'> (#135146 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135146 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #135148, #135145	2024-09-06 02:56:50 +00:00
Edward Z. Yang	d8543e3162	Include exception type qualname when rewrapping InternalTorchDynamoError (#135145 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135145 Approved by: https://github.com/drisspg, https://github.com/anijain2305 ghstack dependencies: #135148	2024-09-06 02:56:50 +00:00
Edward Z. Yang	ad01fc194d	Consolidate raise and rewrap raise error branches (#135148 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135148 Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/yanboliang, https://github.com/malfet	2024-09-06 02:56:46 +00:00
Haibo Chen	e162414963	add instrumentation of CCA stats for reserved and allocated memory size (#135231 ) As titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231 Approved by: https://github.com/c-p-i-o	2024-09-06 02:48:56 +00:00
Edward Z. Yang	9e5a797771	Improve test_public_bindings import module error reporting (#135258 ) Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action. Example failure: ``` 2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)" 2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py) 2024-09-05T20:04:45.3026990Z ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258 Approved by: https://github.com/albanD	2024-09-06 02:40:03 +00:00
atalman	b46a1b9e2d	Use Python 3.9 on all libtorch jobs (#135245 ) Part of the migration py3.8->3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135245 Approved by: https://github.com/izaitsevfb	2024-09-06 02:27:22 +00:00
Sunita Nadampalli	9688014820	aarch64: extend matmul heuristic checks to all neoverse platforms (#134548 ) for aarch64 neoverse platforms there are two gemm backends available for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS. While Arm Compute Library provides better performance over OpenBLAS, it has overhead for the kernel launch time, and hence we use OpenBLAS for smaller tensor compute. The heuristic was originally implemented for neoverse_v1. This commit extends the heuristic to other neoverse platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548 Approved by: https://github.com/malfet	2024-09-06 01:40:50 +00:00
titaiwangms	8f6e73f068	[ONNX] Enable experimental exporter logic to dynamo_export and support refine dynamic_shapes (#134976 ) (1) Enable experimental exporter logic to dynamo_export (2) Refine dynamic shapes and retry export in export strategies (3) Delete `torch_export_graph_extractor` and use the new export logic (4) Disable ExportedProgram test in `test_fx_onnx_with_onnxruntime.py`, as ONNXProgram is different now. Fixes https://github.com/pytorch/pytorch/issues/126479 Fixes #135183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134976 Approved by: https://github.com/justinchuby	2024-09-06 01:29:56 +00:00
Bin Bao	1e57ef08fa	[AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475, #134783	2024-09-06 01:01:53 +00:00
Bin Bao	614b86d602	[AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475	2024-09-06 01:01:53 +00:00
Bin Bao	0b96dfb736	[AOTI] Support MKLDNN conv ops in cpp wrapper (#134475 ) Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi	2024-09-06 01:01:53 +00:00
Shivam Raikundalia	62b221d5cc	Add Percentages to Function Events (#135155 ) Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there. Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values Differential Revision: D62210351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155 Approved by: https://github.com/shanw-meta, https://github.com/kit1980	2024-09-06 00:39:11 +00:00
Laith Sakka	66dd4577b1	Track base of FunctionalTensor in inference mode. (#135141 ) The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141 Approved by: https://github.com/zou3519	2024-09-06 00:10:25 +00:00
cyy	cc28634172	[Submodule] Bump pybind11 to v2.13.5 (#135202 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202 Approved by: https://github.com/Skylion007	2024-09-06 00:09:00 +00:00
wz337	c83cdf068b	[DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054 ) We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3). When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518 ``` # uneven case where the size of the tensor dimension to shard is 1 p = torch.randn(1,2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(1, 2) # this would result in replication, meaning t is now replicated across all ranks. # uneven case where the size of the tensor dimension to shard is not 1 p = torch.randn(3, 2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(3, 2) # this would not result in replication. # this would not result in replication, meaning t stays as sharded. # even case p = torch.randn(2,2) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(2, 2) # this would not result in replication, meaning t stays as sharded. ``` Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054 Approved by: https://github.com/tianyu-l, https://github.com/wanchaol	2024-09-06 00:03:54 +00:00
titaiwangms	28ccfba248	[ONNX] Delete ONNXProgramSerializer (#135261 ) Fixes #135182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261 Approved by: https://github.com/justinchuby	2024-09-05 23:52:51 +00:00
Jason Ansel	b2386bdca1	[debug] Add helper to run cProfile on a function (#135084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135084 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082	2024-09-05 23:41:30 +00:00
Jason Ansel	bdfc8d9f96	[fx] Don't use generators in map_aggregate (#135082 ) While the generators avoid a copy, they are slow. Before: ![image](https://github.com/user-attachments/assets/70a55a9a-0595-4105-b0ab-22cf77c7409c) After: ![image](https://github.com/user-attachments/assets/cecb9c59-ae36-47de-8b08-cab2c7cb3d57) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135082 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076	2024-09-05 23:41:30 +00:00
Jason Ansel	70779dded8	[fx] Compile time optimization in Node.__update_args_kwargs (#135076 ) Before this we took two passes over all of the args. Before: ![image](https://github.com/user-attachments/assets/24ce5628-03f4-4983-9f2d-5ddf0ca5816e) After: ![image](https://github.com/user-attachments/assets/c9681aa2-32f0-4f6b-a598-fc6f90ffafb5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135076 Approved by: https://github.com/Chillee ghstack dependencies: #135070	2024-09-05 23:41:30 +00:00
Jason Ansel	ea231300d1	[inductor] Improve compile time regression from MemoryDep.normalize (#135070 ) Possible fix for #135056 Before ![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae) After ![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8) Measured based on: ``` python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070 Approved by: https://github.com/Chillee	2024-09-05 23:41:30 +00:00
PyTorch MergeBot	8f66995459	Revert "Support rolling over a percentage of workflows (#134816 )" This reverts commit fc890b55b51098437b6149abf1026a8b2aaee389. Reverted https://github.com/pytorch/pytorch/pull/134816 on behalf of https://github.com/malfet due to Causes lint to intermittently fail ([comment](https://github.com/pytorch/pytorch/pull/134816#issuecomment-2332902609))	2024-09-05 23:39:41 +00:00
Kulin Seth	144fde4fd2	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Need to run inductor/test_cpu_select_algorithm Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Roy Hvaara <roy@lightyear.no>	2024-09-05 23:23:17 +00:00
Avik Chaudhuri	43f4947d44	fix fake tensor tolist implementation (#135131 ) Summary: When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies. Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints. Test Plan: Some expected failures are gone now. Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes. Differential Revision: D62197742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131 Approved by: https://github.com/ezyang	2024-09-05 23:20:31 +00:00
Chirag Pandya	65e1c34061	[rfc] scuba for flight recorder (#134794 ) Summary: Record flight recorder status in a scuba table. Test Plan: Testing with timing out a job. Will post results soon. Differential Revision: D61729221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134794 Approved by: https://github.com/fduwjj	2024-09-05 23:18:10 +00:00
Stonepia	830247c355	[Intel Triton] Update Intel Triton to release/2.5.0 (#134074 ) This PR relands https://github.com/pytorch/pytorch/pull/134053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134074 Approved by: https://github.com/EikanWang	2024-09-05 22:46:31 +00:00
Yidi Wu	4262755b5a	[cond] fix typo in cond codegen (#134708 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134708 Approved by: https://github.com/jansel	2024-09-05 22:38:24 +00:00
Edward Z. Yang	3825607144	Add torch._logging.scribe (#135224 ) See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database). Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure from inside CI jobs. The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main. Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224 Approved by: https://github.com/Skylion007	2024-09-05 22:37:13 +00:00
eqy	3c8f71ff93	[cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890 ) For longstanding issues such as #95024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134890 Approved by: https://github.com/Skylion007	2024-09-05 22:22:45 +00:00
Zain Rizvi	fc890b55b5	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-05 22:21:45 +00:00
Animesh Jain	058a69d91a	[fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928 ) As Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928 Approved by: https://github.com/ezyang	2024-09-05 22:05:54 +00:00
sanchitintel	6c5920d515	Tune int8 AMX WoQ micro-kernel for CPU (#134832 ) This patch prevents performance regression against the default ATen implementation for LLaMA 3.1 int8 GPTQ WoQ workload. Uses AMX micro-kernel only if `M` >= `block_m` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134832 Approved by: https://github.com/jgong5	2024-09-05 22:01:14 +00:00
Zhengxu Chen	116fd474da	[export] Expand coverage to more copied sym ops for unflattener. (#135119 ) Test Plan: buck2 test 'fbcode//mode/opt' fbcode//torchrec/ir/tests:test_serializer -- --run-disabled ``` File changed: fbcode//caffe2/torch/export/unflatten.py Buck UI: https://www.internalfb.com/buck2/2e0377e7-e2b6-4bd0-8133-a787245165a0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549824883887 Network: Up: 0B Down: 0B Jobs completed: 16. Time elapsed: 10.2s. Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62190172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135119 Approved by: https://github.com/yushangdi	2024-09-05 21:58:20 +00:00
Scott Wolchok	a5d70cf545	[PyTorch] Add isfinite to BFloat16-math.h (#135052 ) Missing function from <cmath>. Differential Revision: [D62148884](https://our.internmc.facebook.com/intern/diff/D62148884/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135052 Approved by: https://github.com/PaliC, https://github.com/albanD ghstack dependencies: #135031	2024-09-05 21:50:36 +00:00
Scott Wolchok	7fe819d917	[PyTorch] Fix -Wshadow -Werror build in BFloat16-inl.h (#135031 ) `float_t` is required to exists in C99 math.h, which causes -Wshadow to fire. We don't need the alias, fortunately. Differential Revision: [D62135908](https://our.internmc.facebook.com/intern/diff/D62135908/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135031 Approved by: https://github.com/albanD	2024-09-05 21:48:21 +00:00
PyTorch MergeBot	f63571060c	Revert "Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 )" This reverts commit 9c0b03020b7204ca5d5dbe18174bab005f79c47b. Reverted https://github.com/pytorch/pytorch/pull/135264 on behalf of https://github.com/atalman due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/135264#issuecomment-2332674607))	2024-09-05 21:43:05 +00:00
Yidi Wu	38fead8f7c	[hop] preserve metadata in re-tracing hop subgraph by running with interpreter (#135159 ) In this way, the interpreter.run can preserve the current metadata of subgraphs correctly when tracing the subgraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135159 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:36:56 +00:00
Huy Do	24a223c49d	Run inductor micro benchmark on x86 metal runner (#135042 ) This enables inductor micro benchmark on CPU (x86): * Running on AWS metal runner for more accurate benchmark * I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64) The next step would be to run this one cpu arm64, and cuda (a10g). ### Testing Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180 ``` name,metric,target,actual,dtype,device,arch,is_model mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042 Approved by: https://github.com/yanboliang	2024-09-05 21:31:36 +00:00
Will Feng	e4920a1364	[Traceable FSDP2][Dynamo] allow tracing through auto_functionalized HOP (#135169 ) If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it. Test commands: - `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169 Approved by: https://github.com/zou3519	2024-09-05 21:22:45 +00:00
Shangdi Yu	bc5ecf83d7	[training ir migration] Fix quantization tests (#135184 ) Summary: Fixed some quantization tests for new training ir: Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e ``` Reviewed By: tugsbayasgalan Differential Revision: D62209819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:19:28 +00:00
PyTorch MergeBot	e55c0f59e5	Revert "[Reland] Refactor caching device allocator utils (#130923 )" This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e. Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))	2024-09-05 21:16:14 +00:00
PyTorch MergeBot	a4cf9653ee	Revert "Remove Caffe2 code from tool scripts (#134941 )" This reverts commit c818ecd1698a28d9fadf4a81453a89914b18374a. Reverted https://github.com/pytorch/pytorch/pull/134941 on behalf of https://github.com/kit1980 due to breaking internal builds - The path `caffe2/operators/hip/gather_op.cuh` does not exist ([comment](https://github.com/pytorch/pytorch/pull/134941#issuecomment-2332636624))	2024-09-05 21:12:54 +00:00
atalman	9c0b03020b	Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 ) To be consistent with https://github.com/pytorch/pytorch/pull/135263 and rest of workflows. Use v4.4.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135264 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-05 21:05:06 +00:00
Jack Taylor	034717a029	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-09-05 20:36:45 +00:00
Angela Yi	9c38b00999	[export] Add ability to run eagerly on UnflattenedModule (#133996 ) Summary: Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module. This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030 Test Plan: CI Differential Revision: D60939034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996 Approved by: https://github.com/pianpwk	2024-09-05 20:28:42 +00:00
atalman	8efe547046	Use actions/upload-artifact@v4.4.0 for triton builds (#135263 ) Same as: https://github.com/pytorch/pytorch/pull/135139 Fixes upload failure: https://github.com/pytorch/pytorch/actions/runs/10722567217/job/29748125015 fix regression introduced by https://github.com/pytorch/pytorch/pull/135068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135263 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-09-05 20:03:39 +00:00
rzou	82d00acfee	Allow cross-device copies for cpu scalars in refs (#135140 ) This copies our eager-mode behavior where someone can do torch.add(a, b, out=c) where a and b are CPU scalar tensors and c is a CUDA tensor. Fixes https://github.com/pytorch/pytorch/issues/121619 by side effect (we get into a situation where we're writing a CPU scalar into a FakeTensor that is actually a meta tensor) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135140 Approved by: https://github.com/williamwen42, https://github.com/yanboliang	2024-09-05 19:08:48 +00:00
Zhonglin Han	098431a29d	Update Resize.cpp with new device type (#135117 ) Update Resize.cpp with new device type Pull Request resolved: https://github.com/pytorch/pytorch/pull/135117 Approved by: https://github.com/egienvalue	2024-09-05 18:53:13 +00:00
Xintong Hu	be660ea2d3	[PT2] Directly set meta.val in group_batch_fusion_aten (#135078 ) Summary: instead of using FakeTensorProp after the pass Differential Revision: D62162640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135078 Approved by: https://github.com/frank-wei	2024-09-05 18:17:06 +00:00
CaoE	52c7c89ea4	[Inductor][CPP] Leverage full bits for BF16/FP16 vectorization (#126502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126502 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-05 17:17:46 +00:00
IvanKobzarev	1efd341d15	[fake_tensor] Move unrecognized_type NotImplemented before ConstProp (#135033 ) We should not try to do ConstProp on the unrecognized types (e.g. Subclasses). In case of those types throwing NotImplemented will jump to the next torch_dispatch. Test: ``` python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-09-05 17:09:41 +00:00
Mikayla Gawarecki	a096f2899d	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-09-05 16:53:39 +00:00
Edward Z. Yang	dbeb8a1691	Render log filepaths that are not anchored in torch's directory in a reasonable way (#135165 ) For example, if I do TORCH_LOGS=fbscribelogger I'll get: ``` I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop ``` instead of ``` I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165 Approved by: https://github.com/Skylion007	2024-09-05 16:48:09 +00:00
mori360	b1f72e2984	Gradient scaler for DTensor (#132816 ) Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798). Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()` Related dispatch strategy is added to accept DTensor input. To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel. Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases: 1. whether the non-inf values unscaled 2. whether all DTensors at each device could found inf even not at their device. 3. If inf not found, will new parameters generates 4. if inf found, will scale be updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol	2024-09-05 16:44:32 +00:00
Henry Tsang	bb3c2408f4	[inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936 ) Differential Revision: D61506212 Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`. `instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`. Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda. FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device' repro: ``` CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-09-05 16:40:14 +00:00
Tom Ritchford	2c99f17a32	Implement VariableTracker.python_type() (#134215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134215 Approved by: https://github.com/amjames, https://github.com/jansel	2024-09-05 16:35:47 +00:00
Tarun Karuturi	0043dcd79e	Switch torch pt2e xnnpack tests to use export_for_training (#134788 ) Migrate all the callsites inside the pt2e XNNPACK tests to use export_for_training. Differential Revision: D61994553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134788 Approved by: https://github.com/mergennachin	2024-09-05 16:11:18 +00:00
Edward Z. Yang	2e2fb668fa	Upgrade expecttest to 0.2.1 (#135136 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135136 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/Skylion007	2024-09-05 16:05:35 +00:00
Stonepia	9d24f945ba	[CI] Use larger instance for building triton whl (#135201 ) When running CI jobs of "Build Triton Wheels", it failed due to the lack of resources. This PR uses a larger runner to avoid these issues. The failure message is like: ``` Process completed with exit code 137. ``` Related running actions: Failed actions: https://github.com/pytorch/pytorch/actions/runs/10714445036 Success actions: https://github.com/pytorch/pytorch/actions/runs/10716710830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135201 Approved by: https://github.com/chuanqi129, https://github.com/atalman	2024-09-05 14:36:23 +00:00
min-jean-cho	ecbd715363	[Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093 ) The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake. Fix to not override the default `CMAKE_CXX_FLAGS`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-05 12:52:43 +00:00
Xinyu	58f2477a26	[Dynamo] Support builtin function frozenset (#134563 ) Support builtin function frozenset in dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/134563 Approved by: https://github.com/anijain2305, https://github.com/EikanWang, https://github.com/jansel	2024-09-05 12:15:10 +00:00
sanchitintel	43dcb4bb61	Revise CPU vectorization ISA support API (#135075 ) Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang	2024-09-05 12:14:56 +00:00
Bin Bao	50d1e37079	[AOTI] Fix a unbacked symint retrieve bug (#134670 ) Summary: Fix https://github.com/pytorch/pytorch/issues/134081. When a unbacked symint is computed as the shape of a tensor from a tuple, generated C++ code needs to use std::get<> to extract the tensor. Differential Revision: [D62142113](https://our.internmc.facebook.com/intern/diff/D62142113) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134670 Approved by: https://github.com/angelayi, https://github.com/22quinn, https://github.com/chenyang78	2024-09-05 11:34:14 +00:00
Feng Yuan	b99ef1a02e	Update torch-xpu-ops pin (ATen XPU implementation) (#135185 ) Release cycle for PyTorch 2.5 1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185 Approved by: https://github.com/EikanWang	2024-09-05 10:05:23 +00:00
Jack Zhang	8a5c8e5db9	Update unbacked symints in masked_select more precisely (#134899 ) ## Summary At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape. This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`). ## Test plan - Passes existing unit tests (tests case where upper bound is inf) - Added unit test to verify upper bound reduction calculation - Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899 Approved by: https://github.com/ezyang	2024-09-05 09:01:06 +00:00
Yutao Xu	c7328dff7f	Enhance the stability of the complex divide code (#134647 ) In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double. ```c++ float f = 3.14f; if (f == 3.14) { // Do something } ``` If a device does not support double, an error will occur. This PR addresses the issue of complex64 errors on machines that do not support double operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-09-05 08:36:37 +00:00
Wu, Chunyuan	749dc6ceda	[inductor] [cpp] use_local_acc if template_buffer_has_other_users (#135081 ) Fix the compilation error of `coat_lite_mini` in timm and `YituTechConvBert` in HF: ``` /tmp/tmpuu94adg_/nf/cnf3zm677wbfjzzll522zvjp57g44udzfnj66ac2t5b2odvfqpts.cpp:239:33: error: invalid conversion from ‘const float’ to ‘float’ [-fpermissive] 239 \| &(in_ptr2[static_cast<int64_t>(n_start + (192Lm_start) + (Nrnci) + ((-1L)Nrnc))]), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| const float* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135081 Approved by: https://github.com/jgong5 ghstack dependencies: #134984	2024-09-05 08:31:31 +00:00
fduwjj	eaeae0ac95	[c10d] Change collective to take in a list of tensors so it work fully for all collectives (#135049 ) We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL. This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049 Approved by: https://github.com/kwen2501	2024-09-05 07:56:56 +00:00
Pian Pawakapan	5a0e7a408f	restore CSE'd node metadata in runtime asserts pass (#134516 ) Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516 Approved by: https://github.com/ezyang	2024-09-05 07:50:04 +00:00
Yan Zhiwei	81a8624296	[Intel GPU] Customized XPU behaviour in indexing, group norm (#134453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134453 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #133980	2024-09-05 07:41:57 +00:00
Wu, Chunyuan	731fd3172a	[inductor] [cpp] generate reindexer for each epilogue_node (#134984 ) Fixes the FP32 accuracy failure of `levit_128` in timm. Previously, we used `Y` which is the output of the final epilogue node to calculate the reindexer. We actually need to use each epilogue node to calculate the reindexer from the GEMM output to the epilogue node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134984 Approved by: https://github.com/jgong5	2024-09-05 07:08:31 +00:00
Tugsbayasgalan Manlaibaatar	9d705605dd	Fix decomp behaviour in export training IR (#134801 ) Subset of changes in https://github.com/pytorch/pytorch/pull/132901, can't land the previous one because it is too complicated. Rest of the change will be implemented as follow up after export design meeting. This part just makes the training IR -> inference IR decomp to have the same path as normal export. Differential Revision: [D62000525](https://our.internmc.facebook.com/intern/diff/D62000525) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134801 Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi	2024-09-05 06:37:44 +00:00
Sun, Jiayi	05feb6e4ed	[Inductor] support masked vectorization for the tail_loop for dynamic shapes (#131745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131745 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-05 06:17:48 +00:00
Pian Pawakapan	7b280c31ba	[export] dynamic_shapes serialization, load/dump (#134718 ) Adds utility functions `_dump_dynamic_shapes` and `_load_dynamic_shapes`. - `_dump_dynamic_shapes`: dynamic shapes spec -> serialized format: - takes in the `dynamic_shapes` pytree object you'd feed into `export()`, and dumps into serialized format - `_load_dynamic_shapes`: serialized format -> dynamic shapes spec - takes the serialized format, and produces a `dynamic_shapes` object you feed into `export()` For example with dumping: ``` dx = Dim("dx", min=4, max=16) dy = dx + 1 inputs = ( [ torch.randn(4, 4), torch.randn(5, 4), ], torch.randn(4), torch.randn(4, 4), "hello", ) dynamic_shapes = { "a": [ (dx, 4), (dy, 4), ], "b": (Dim.AUTO,), "c": None, "d": None, } out = _dump_dynamic_shapes(dynamic_shapes, inputs) ``` would generate the following output: ``` DynamicShapesSpec( dynamic_shapes=( [ ['dx', 4], ['dx + 1', 4], ], ['_DimHint.STATIC'], ['_DimHint.STATIC', '_DimHint.STATIC'], None, ), dims={ 'dx': RootDim( min=4, max=16, derived=['dx + 1'], ), }, ) ``` The serialized format contains 2 keys, `dynamic_shapes` and `dims.` - `dynamic_shapes` is the pytree structure matching the input to `export()`, with strings in place of Dim names and enums, and ints/Nones otherwise. Each tensor is represented with a list of shapes, non-tensors with Nones. - `dims` contain min/max range and derived dims info for each root dim. The test cases show some roundtrippability guarantees for these functions. Definitely taking naming suggestions for them :) Follow up: utility function to extract serializable format from ExportedProgram. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134718 Approved by: https://github.com/avikchaudhuri	2024-09-05 05:39:44 +00:00
PyTorch UpdateBot	f2a7228aed	[executorch hash update] update the pinned executorch hash (#135162 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135162 Approved by: https://github.com/pytorchbot	2024-09-05 04:21:51 +00:00
Will Feng	8fb1281db9	[Traceable FSDP2] Skip _backward_prefetch under compile, and rely on compiler pass to have prefetching (#135163 ) Before this PR, when traceable FSDP2 + AC is run, an error would be thrown: ``` File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem return args[0].call_method(tx, "__getitem__", args[1:], kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method return self.getitem_const(tx, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const return self.items[index] Error: Index out of bound from user code: File "<eval_with_key>.5", line 105, in forward aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34); aot0_tangents_1 = None File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke return _trace_wrapped_op(args, dyn_kwargs, kwargs) File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state return getattr(bw_state, hook_name)(args, **kwargs) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward self._fsdp_param_group.pre_backward(default_prefetch) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward self._backward_prefetch() File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index] ``` Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163 Approved by: https://github.com/awgu	2024-09-05 03:32:04 +00:00
ZhiweiYan-96	a7a53b796b	[Intel GPU]device guard codegen for XPU (#133980 ) This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation. With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated. ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { std::optional<Device> common_device = std::nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out"); c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean"); const OptionalDeviceGuard device_guard(device_of(out)); return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Nevertheless, without current change, the generated code is ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { // No device check // DeviceGuard omitted return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-09-05 01:53:31 +00:00
Bob Ren	30b98940b8	Fix typo in comment (#135111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135111 Approved by: https://github.com/aorenste, https://github.com/oulgen	2024-09-05 01:39:04 +00:00
Wei Feng	724faac260	[FSDP] casting input args with dataclass(frozen=True) (#135067 ) resolve: https://github.com/pytorch/pytorch/pull/135029 when enabling mixed precision, FSDP cast input args to desired dtype by calling `_apply_to_tensors`. When input args has `dataclass(frozen=True)`, we hit following runtime error, because of using `setattr` in `_apply_to_tensors` `dataclasses.FrozenInstanceError: cannot assign to field 'some_key'`. The fix is to use dataclasses api `dataclasses.replace` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135067 Approved by: https://github.com/awgu	2024-09-05 01:19:53 +00:00
Aleksei Nikiforov	04e11c7eed	Update current scripts used for setting up s390x runners (#129866 ) Update current scripts used for setting up s390x runners Just a documentation update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129866 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-05 01:17:54 +00:00
drisspg	a3e0d4bf07	[FlexAttention] Fix mismatched backward strides for eager impl (#135152 ) # Fixes: The first repro from: https://github.com/pytorch/pytorch/issues/134888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135152 Approved by: https://github.com/Chillee	2024-09-05 01:14:53 +00:00
FFFrog	27d86f93fe	Remove redundant code (#134955 ) Remove GetPrivateUse1HooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/134955 Approved by: https://github.com/Skylion007	2024-09-05 01:11:32 +00:00
Animesh Jain	32f45f01a9	[dynamo] Retire CompileProfiler (#135133 ) Fixes confusion in https://github.com/pytorch/pytorch/issues/113443 We have TORCH_LOGS that supersedes CompileProfiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/135133 Approved by: https://github.com/ezyang ghstack dependencies: #135039, #135121, #135129, #135130	2024-09-05 01:08:40 +00:00
fduwjj	4a661e089a	[FR] Add version based logic to FR script and make traces print can be filtered (#135154 ) This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces. Some minor refactors to make the name more accurate and util function working. <img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154 Approved by: https://github.com/wconstab	2024-09-05 00:59:32 +00:00
Nikita Shulga	105ac2418c	Fix binary builds artifact download (#135139 ) By upgrading upload-artifacts action to v4.4.0 As artifact store layout is different between v3 and v4 actions and artifacts uploaded by v3 can not be downloaded by v4 Should fix`Unable to download artifact(s): Artifact not found for name: libtorch-cpu-shared-with-deps-release`, which could be seen for example [here](https://github.com/pytorch/pytorch/actions/runs/10707740040/job/29690137218#step:7:29) I.e. fix regression introduced by https://github.com/pytorch/pytorch/pull/135068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135139 Approved by: https://github.com/atalman, https://github.com/huydhn	2024-09-05 00:43:34 +00:00
Laith Sakka	560f449d8f	Fix: use clone_preserve_strides in auto_functionalized_v2 (#135142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135142 Approved by: https://github.com/zou3519 ghstack dependencies: #134409	2024-09-05 00:39:48 +00:00
Aidyn-A	956da79bda	[CUDA][AMP] Fix autocast_dtype (#133938 ) Fixes #132715 The failure in #132715 is due to `autocast_dtype` being a thread-local variable. It causes inconsistencies between `get_autocast_dtype()` among different threads. To be exact, what is happening in the following: The amp dtype is set to `bfloat16` on main thread. The `backward` call runs on a side thread, so `at::autocast::prioritize` fails because `lower_precision_fp` defaults to `float16`: `6f738d6434/aten/src/ATen/autocast_mode.h (L221-L225)` This PR makes `autocast_dtype` thread-global so it consistent among all threads of forward and backward passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133938 Approved by: https://github.com/soulitzer	2024-09-05 00:07:32 +00:00
chuanqiw	977a909250	[CI] Build pytorch wheel with Torch XPU Operators on Windows (#133151 ) # Description This pipeline enables the CI build on Windows with PR labeled with ciflow/xpu. This will build torch binary with Torch XPU Operators on Windows using Vision Studio BuildTools 2022. # Changes 1. Install xpu batch file (install_xpu.bat) - Check if build machine has oneAPI in environment, and if the version of it is latest. If not, install the latest public released oneAPI in the machine. 2. GHA callable pipeline (_win-build.yml) - Set vc_year and use_xpu as parameter to set build wheel environment. 3. GHA workflow (xpu.yml) - Add a new windows build job and pass parameters to it. 4. Build wheels script (.ci/pytorch/win-test-helpers/build_pytorch.bat) - Prepare environment for building, e.g. install oneAPI bundle. # Note 1. For building wheels on Intel GPU, you need Vision Studio BuildTools version >= 2022 2. This pipeline requires to use Vision Studio BuildTools 2022 to build wheels. For now, we specify "windows.4xlarge.nonephemeral" as build machine label in the yaml file. We will request to add self-hosted runners with Intel GPU and Vision Studio BuildTools 2022 installed soon. Work for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133151 Approved by: https://github.com/chuanqi129, https://github.com/atalman Co-authored-by: chuanqiw <chuanqi.wang@intel.com>	2024-09-05 00:02:46 +00:00
Howard Huang	b3ef0c99f5	[PP] Fix zero bubble composability with DP (#134052 ) Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers. FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients. Fixes the tests: `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False` `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052 Approved by: https://github.com/kwen2501	2024-09-04 23:46:29 +00:00
Benjamin Glass	43c9b4e0e6	Fix unintentional deduplication of returned tensors (#134726 ) When CSE was used, returned tensors that had gone through identical processing steps but were distinct from a data perspective were pruned out of the graph. This commit protects tensors which are directly output from being pruned, and adds a test for this behavior. Closes #88813 and #114344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134726 Approved by: https://github.com/amjames, https://github.com/zou3519, https://github.com/bdhirsh	2024-09-04 23:42:56 +00:00
titaiwangms	00a8666708	[ONNX] Support output_names in dynamic_axes when dynamo=True (#135134 ) Previous to this PR, if output_names shows in dynamic_axes, it errors when we turn it to dynamic_shapes of torch.export, as we only recognized input_names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135134 Approved by: https://github.com/justinchuby	2024-09-04 23:42:13 +00:00
eqy	4f70b3cfae	[CUDA][complex][TF32] Update `test_noncontiguous_samples` tolerances for `complex64` (#134526 ) Recent cuDNN heuristics change surfaces same TF32 issue as `float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526 Approved by: https://github.com/ezyang	2024-09-04 23:37:16 +00:00
Shangdi Yu	359077fa43	[export] Fix indentation (#135128 ) Summary: as title Test Plan: CI Differential Revision: D62195680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135128 Approved by: https://github.com/tugsbayasgalan	2024-09-04 23:26:36 +00:00
Ke Wen	9810ce9ca7	[PP] Go back to export instead of _export (#134299 ) Reverts https://github.com/pytorch/pytorch/pull/130998 because FakeTensor + real device suffice to work around the autocast issue in HF. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134299 Approved by: https://github.com/lessw2020	2024-09-04 23:25:17 +00:00
Animesh Jain	804852c1f9	[dynamo] Search for _torchdynamo_inline only for functions (#135130 ) Issue seen in https://github.com/pytorch/pytorch/issues/93633 Fixes https://github.com/pytorch/pytorch/issues/93633 Unable to create a testcase Pull Request resolved: https://github.com/pytorch/pytorch/pull/135130 Approved by: https://github.com/williamwen42, https://github.com/yanboliang ghstack dependencies: #135039, #135121, #135129	2024-09-04 23:02:59 +00:00
Sun, Jiayi	13a4a0c60d	[Inductor] Apply loop split optimization in codegen_node (#132389 ) This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load. Example: ``` import torch import torch.nn as nn class GNReLU(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GNReLU, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return torch.nn.functional.relu(self.gn(x)) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GNReLU(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0))); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 14); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360L*x0)), 14); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-04 22:42:46 +00:00
Animesh Jain	87842cc658	[dynamo][super] Corner case where the class is not present in the __mro__ (#135129 ) I could not come up with a testcase. This was seen in https://github.com/pytorch/pytorch/issues/93633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135129 Approved by: https://github.com/yanboliang ghstack dependencies: #135039, #135121	2024-09-04 22:30:09 +00:00
Michael Lazos	d9ae92cd6e	[Dynamo] Support for proxying frozen dataclasses (#134846 ) Fixes https://github.com/pytorch/pytorch/issues/133858 Details: Previously Dynamo would treat dataclasses as UserDefinedVariables. This was non-desirable if we would like to proxy the value into the graph, which is needed for TensorSubclassMetadata. To rectify this, frozen dataclasses are now able to be proxied similarly to NamedTuples. We require the object to be frozen, because if arbitrary mutation were allowed, we would need to replay those mutations in the graph after construction of the object. For tracing construction of the variable, the generated `__init__` for the dataclass uses `object.__setattr__` because frozen dataclasses throw errors on the usual `__setattr__` invocation. With this treatment, no special handling is needed in dynamo for frozen dataclass construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134846 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2024-09-04 22:17:00 +00:00
Xilun Wu	ed06772e35	[TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062 ) Summary Extend the warning message to be more self-explained Pull Request resolved: https://github.com/pytorch/pytorch/pull/135062 Approved by: https://github.com/shuqiangzhang	2024-09-04 22:05:51 +00:00
Nikita Shulga	fb1c580892	[BE][optim] Make pyright recognize exported symbols (#135043 ) Follows pattern introduced by https://github.com/pytorch/pytorch/pull/80955 which [pyright](https://github.com/microsoft/pyright) prefers over `__all__` symbol, see https://github.com/microsoft/pylance-release/issues/2953#issuecomment-1168956296 Fixes https://github.com/pytorch/pytorch/issues/134985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135043 Approved by: https://github.com/janeyx99	2024-09-04 21:53:46 +00:00
rzou	2276940f8c	Make Dynamo inline through torch._library.custom_ops.autograd (#135066 ) Fixes https://github.com/pytorch/pytorch/issues/135057 The bug was: in the situation that Dynamo graph breaks in the forward and Compiled Autograd uses Dynamo to introspect the backward, we end up running into a "Unsupported: inlining through SKIPFILES" error. The solution is to mark the entirety of this module as inlineable. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135066 Approved by: https://github.com/bdhirsh, https://github.com/williamwen42, https://github.com/yanboliang	2024-09-04 21:48:28 +00:00
Manuel Candales	4e6df83d19	[PT] Add out variant for avg_pool1d and adaptive_avg_pool1d (#135051 ) Test Plan: CI Differential Revision: D62148410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135051 Approved by: https://github.com/SS-JIA	2024-09-04 21:20:01 +00:00
Animesh Jain	a8611da86f	[dynamo][backend match] Optimize backend match for common case (#135121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135121 Approved by: https://github.com/williamwen42 ghstack dependencies: #135039	2024-09-04 21:02:29 +00:00
Boyuan Feng	09a339fc06	[Flex Attention] update __getitem__ without tree_map_only to support compile (#134627 ) Adds a helper function for getting the block mask for a specific row index during decoding. We need this change to avoid the pytree + torch.compile issue #134731. Tested in gpt-fast [pr](https://github.com/pytorch-labs/gpt-fast/pull/196). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134627 Approved by: https://github.com/Chillee	2024-09-04 20:09:41 +00:00
PyTorch MergeBot	741d52c69f	Revert "Add support for 32KB multi_tensor_apply kernel arguments (#134373 )" This reverts commit 08184aa85cf183198ebdf2fd7a49fe7bc4842c13. Reverted https://github.com/pytorch/pytorch/pull/134373 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/135126 for more details ([comment](https://github.com/pytorch/pytorch/pull/134373#issuecomment-2329839011))	2024-09-04 19:44:29 +00:00
Saurabh Mishra	dd7cd182ab	[AIInfra][DCP] All gather keys checkpoint utils bug fix (#135045 ) Summary: All gather keys checkpoint utils bug fix. Dist. get_world_size should have the process group passed in to avoid inconsistent world size in case the process group has changed. This is common in the tests. Test Plan: UTs Reviewed By: Saiteja64 Differential Revision: D61578832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135045 Approved by: https://github.com/MeetVadakkanchery, https://github.com/LucasLLC	2024-09-04 18:49:34 +00:00
Shivam Raikundalia	eb0fd17bc4	[Profiler] Fix Raw Metadata Iterator (#135096 ) Summary: D62008788 added an extra parameter to the RawTensorMetadata struct. For some reason this causes some corrupted accesses in other tests as described in T200685032. Once this is removed the tests pass. Going forward we need to document how to add parameters to this portion of the code as the AppendOnlyLists seem to be very rigid. Test Plan: Ran all the tests locally and they all passed. Differential Revision: D62171089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135096 Approved by: https://github.com/aaronenyeshi	2024-09-04 18:41:50 +00:00
PyTorch MergeBot	c88c19c6de	Revert "restore CSE'd node metadata in runtime asserts pass (#134516 )" This reverts commit 1dfb1052395d908ed6e67288c9357e16022da272. Reverted https://github.com/pytorch/pytorch/pull/134516 on behalf of https://github.com/pianpwk due to breaking NestedTensor test ([comment](https://github.com/pytorch/pytorch/pull/134516#issuecomment-2329738450))	2024-09-04 18:41:21 +00:00
Shunting Zhang	873abfc18e	[inductor] fix compile time regression due the (disabled) loop ordering after fusion (#135071 ) It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time. Here are some number I get from a H100 on BertForMaskedLM: - without the fix, cold start compilation time is around 82s - with the fix, cold start compilation time is around 76s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071 Approved by: https://github.com/jansel	2024-09-04 18:36:59 +00:00
rzou	d7b57c4d63	Fix tensor.data access under inference_mode and compile (#134878 ) Fixes https://github.com/pytorch/pytorch/issues/134798 In the regular Tensor case, when you call Tensor.data, there's a check for if inference mode is active. If it is active, then we don't set the version counter. We replicate this check for Tensor Subclasses (the bug was we were trying to set the version counter on a FakeTensor in inference_mode). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134878 Approved by: https://github.com/bdhirsh	2024-09-04 17:55:41 +00:00
Svetlana Karslioglu	0d193a0adf	Add ExecuTorch warning to mobile_optimizer (#134697 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/134697/mobile_optimizer.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/134697 Approved by: https://github.com/ali-khosh, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-04 17:47:14 +00:00
Jason Ansel	193c547461	[inductor] Refactor simplify erase_nodes() (#134822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134822 Approved by: https://github.com/shunting314 ghstack dependencies: #134748, #134749	2024-09-04 17:32:07 +00:00
Jason Ansel	2ddf3ed707	[inductor] Allow cudagraphs with unused CPU inputs (#134749 ) This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134749 Approved by: https://github.com/shunting314 ghstack dependencies: #134748	2024-09-04 17:32:07 +00:00
Jason Ansel	cff1158200	[inductor] Pass to fix device on index(..., [iota]) (#134748 ) This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134748 Approved by: https://github.com/shunting314	2024-09-04 17:31:58 +00:00
PyTorch MergeBot	7858045491	Revert "Fix set_unbacked_bindings when list of Tensors is returned (#133585 )" This reverts commit 2a49296d7563150d67bb00bd4c97bc5aafaa77df. Reverted https://github.com/pytorch/pytorch/pull/133585 on behalf of https://github.com/ezyang due to fails torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/133585#issuecomment-2329602983))	2024-09-04 17:21:32 +00:00
PyTorch MergeBot	8759ed2ac5	Revert "Compute and do renamings even when ignoring fresh unbacked symbols (#134407 )" This reverts commit 46cb2af7d822681298370bab9d49b3cba5546dd5. Reverted https://github.com/pytorch/pytorch/pull/134407 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))	2024-09-04 17:18:21 +00:00
PyTorch MergeBot	fc07e6bf56	Revert "Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 )" This reverts commit a178a053ad2c8e42d1b684ed38385b9646ec3b74. Reverted https://github.com/pytorch/pytorch/pull/135053 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))	2024-09-04 17:18:21 +00:00
Laith Sakka	c8ab9b06a2	Redesign custom op functionlaization for better re-inplace (#134409 ) - The new implementation (auto_functionalized_v2) is enabled by default but can be disable using an inductor flag. - In export mode the old implementation is used. Motiviation Previous functionalization fails to re-inplace arguments when they are view over other tensors. see issue https://github.com/pytorch/pytorch/issues/131192 The new functionalization is easier to re-inplace for views. A) Functionalizations pass consider a program: ``` func(t) x = t[0] y = t[1] foo(x, y) # custom operator with x, y mutable return (x, y, t) ``` - To functionalize `foo` we generate a function that operates on the base tensors of the inputs; (x.base() and y.base()) and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())``` - Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function. ``` auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 , _y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1 , _all_bases = [arg0_1]) ``` - In the code above: - _all_bases[t]: refers to a unique set of bases for all foo arguments. - for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1) regenerate x from _all_bases[_x_base_index] or a copy of a the base. - the output of auto_functionalized is foo output , followed by x tensors one for each base in _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base. - for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output for the function above after functionalization we get : ``` def forward(self, arg0_1: "f32[2][1]cpu"): auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1]) getitem_1: "f32[2][1]cpu" = auto_functionalized[1]; auto_functionalized = None copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1); arg0_1 = copy_ = None # No stacktrace found for following nodes select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0) select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1); getitem_1 = None return (select_2, select_3) ``` B) Semantics of auto_functionalize The new semantics of auto_functionalize is as the following: 1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it) 2. For each arg, regenerate the arg from the copy of its base using the view information above. 3. return the original foo output followed by the new bases. C) Re-inplace pass since auto_functionalize not copy the bases, what we actually inplace is the bases. (run just like before but on the beses instead of args). 1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409 Approved by: https://github.com/zou3519	2024-09-04 17:08:58 +00:00
Shivam Raikundalia	195ac85fb6	[Profiler] Allow kwinputs to be non-string values (#134893 ) Summary: When we process keyword arguments in profiler today we assume that all values will be strings. This breaks HTA because it assumes that "stream" and other values similar to it will be ints. To fix this we will only put quotes around strings for ivalues. Test Plan: Add chrome trace export in unit tests and check that stream does not have quotes around it Differential Revision: D62056059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134893 Approved by: https://github.com/sanrise, https://github.com/izaitsevfb	2024-09-04 16:34:10 +00:00
atalman	60dfe1b35e	Fix lint after Bump actions/download-artifact update (#135109 ) Fixes lint after auto-generated PR: `367a78495f` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135109 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-09-04 15:26:17 +00:00
Avik Chaudhuri	8bfd4916d6	fast path for sympy gcd in floordiv (#134880 ) Summary: Re-implementation of https://github.com/pytorch/pytorch/pull/134150, which was reverted because of some internal tests hanging (case B). The original motivation was to get some other internal test unstuck (case A). The root cause is that sympy.gcd is both very clever as well as can blow up in some cases. This PR introduces a fast path with an appropriate fallback to sympy.gcd that ensures that both cases A and B go through. Test Plan: See the included test for specific examples. Also https://fb.workplace.com/groups/1075192433118967/posts/1491493248155548/?comment_id=1491938994777640&reply_comment_id=1492622821375924 Differential Revision: D62043315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134880 Approved by: https://github.com/ezyang	2024-09-04 14:56:49 +00:00
chuanqiw	67208f08bd	[CD] Enable XPU nightly build on Windows (#134312 ) Depends on https://github.com/pytorch/builder/pull/1975 land. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134312 Approved by: https://github.com/atalman	2024-09-04 14:46:36 +00:00
Edward Z. Yang	6c5669903f	Fix Invalid NaN comparison due to infinity-zero multiply on latest sympy (#135044 ) Fixes https://github.com/pytorch/pytorch/issues/133735 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135044 Approved by: https://github.com/zou3519	2024-09-04 14:13:09 +00:00
Edward Z. Yang	a178a053ad	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ I'm not sure this is the right approach though... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4 ghstack dependencies: #134407	2024-09-04 13:25:08 +00:00
Edward Z. Yang	46cb2af7d8	Compute and do renamings even when ignoring fresh unbacked symbols (#134407 ) This is a bit twisty and I don't entirely understand the situation, but here's my best explanation. In https://github.com/pytorch/pytorch/pull/133588 I am trying to fix a problem reported by user in https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/ The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In #133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. But I don't entirely understand all the interactions. I just know that this seems to not cause tests to fail, and it should fix the internal issue (which I need to add a UT for.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134407 Approved by: https://github.com/ydwu4	2024-09-04 13:25:07 +00:00
FFFrog	5690f003a6	C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED and C10_DIAGNOST should be used in pairs (#135004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135004 Approved by: https://github.com/aaronenyeshi	2024-09-04 13:14:23 +00:00
Thanh Ha	dcf05fcb14	Fix stale job using non-existant ARC runner (#134863 ) The ARC CI system has been shutdown so this job is currently using a runner that doesn't exist. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134863 Approved by: https://github.com/ZainRizvi	2024-09-04 12:57:10 +00:00
FFFrog	a8467c17c3	Remove specific lazy initialization of PrivateUse1 (#135002 ) As the title stated, lazy initialization of PrivateUse1 can been removed because maybe_initialize_device have supported PrivateUse1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135002 Approved by: https://github.com/albanD	2024-09-04 11:45:45 +00:00
FFFrog	80a6d60829	Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-09-04 10:48:58 +00:00
Luca Wehrstedt	c2ff9fe042	[fp8 rowwise] Retune the tile heuristics to increase perf (#134781 ) I propose a new heuristic function to select tile tile size, cluster size, and transposition given M, N and K. It improves the performance across the board (on average) while remaining simple and relying only on a handful of kernels (to limit build time and binary size). Across the shapes I benchmarked, the new heuristic gives a (geometric) mean speedup of +16.5%. Some shapes worsen, but 98.6% of the shapes retain their old performance (up to 5% to allow for noise) or improve it. ![image](https://github.com/user-attachments/assets/bca30583-ac32-4af6-a4f9-37164bdb2430) I benchmarked on over 5.4k different shapes: - For M and N I swept across all values which are the sums of two powers of 2 (limited to multiples of 64, capped at 16,384) - For K I only used powers of 2 between 1,024 and 8,192 (based on the intuition that the optimal config doesn't depend on K, which turned out to be the case) Here's the detailed speedup for each shape ![image](https://github.com/user-attachments/assets/acac4318-9ee0-455d-861b-c764b8c13d22) <details> <summary> This is the code I used to benchmark </summary> ``` import torch import torch.utils.benchmark s = set() for i in range(6, 15): s.add(2i) for j in range(6, i): s.add(2i + 2j) ms = [i for i in sorted(s) if i <= 214] ns = [i for i in sorted(s) if i <= 214] ks = [2i for i in range(10, 14)] def make_graph(n_iters, f): g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): for _ in range(n_iters): f() return g def rowwise_scale(t, dtype_t): min_v, max_v = torch.finfo(dtype_t).min, torch.finfo(dtype_t).max scale_t = torch.clamp(t.abs().amax(dim=-1, keepdim=True).float(), min=1e-12) / max_v t_fp8 = (t / scale_t).clamp(min=min_v, max=max_v).to(dtype_t) return t_fp8, scale_t for m in ms: for n in ns: for k in ks: a = torch.randn((m, k), device="cuda", dtype=torch.float) b_t = torch.randn((n, k), device="cuda", dtype=torch.float) a_fp8, scale_a = rowwise_scale(a, torch.float8_e4m3fn) b_t_fp8, scale_b_t = rowwise_scale(b_t, torch.float8_e4m3fn) func = lambda: torch._scaled_mm( a_fp8, b_t_fp8.t(), scale_a=scale_a, scale_b=scale_b_t.t(), bias=None, use_fast_accum=True, out_dtype=torch.bfloat16 ) print(f"{m=},{n=},{k=}") print(torch.utils.benchmark.Timer("g.replay()", globals={"g": make_graph(1000, func)}).blocked_autorange(min_run_time=1).mean / 1000) ``` </details> <details> <summary> This is the code I used for the plots </summary> ``` from itertools import islice import pandas as pd import matplotlib.pyplot as plt from matplotlib.cm import ScalarMappable from matplotlib.colors import FuncNorm from mpl_toolkits.axes_grid1 import ImageGrid def batched(iterable, n): iterator = iter(iterable) while batch := tuple(islice(iterator, n)): yield batch def try_to_convert(v): if v == "False": return False if v == "True": return True return int(v) def get_from_paste(filename): text = open(filename, "rt").read() headers = [] data = [] for config, value in batched(text.splitlines(), 2): config_elems = config.split(",") if not headers: headers = [e.partition("=")[0] for e in config_elems] data.append((*(try_to_convert(e.partition("=")[-1]) for e in config_elems), float(value))) return pd.DataFrame(data, columns=headers + ["latency"]) old_latencies = get_from_paste(...) new_latencies = get_from_paste(...) ratios = pd.merge(new_latencies, old_latencies, how="left", left_on=["m", "n", "k"], right_on=["m", "n", "k"], suffixes=("_new", "_old")) ratios = ratios.assign(ratio=ratios.latency_old / ratios.latency_new) fig = plt.figure(figsize=(40.0, 10.0)) grid = ImageGrid( fig, 111, nrows_ncols=(1, 4), axes_pad=0.5, share_all=True, cbar_location="right", cbar_mode="single", cbar_size="7%", cbar_pad=0.15, ) log_amax = np.max(np.abs(np.log(ratios.ratio.to_numpy()))) for K, ax in zip([1024, 2048, 4096, 8192], grid): pivoted = ratios[(ratios.k == K)].pivot_table(index="m", columns="n", values="ratio") im = ax.imshow(np.log(pivoted.to_numpy()), origin="lower", vmin=-log_amax, vmax=log_amax, cmap="PiYG") m_vals, n_vals = pivoted.axes ax.set_xticks(np.arange(len(n_vals)), labels=[f"N={i}" for i in n_vals.values], fontsize=12) ax.set_yticks(np.arange(len(m_vals)), labels=[f"M={i}" for i in m_vals.values], fontsize=12) plt.setp(ax.get_xticklabels(), rotation=90, ha="right", rotation_mode="anchor") ax.grid(False) ax.set_title(f"K={K}", fontsize=20) norm = FuncNorm((lambda x: np.log(x), lambda x: np.exp(x)), np.exp(-log_amax), np.exp(log_amax)) ax.cax.colorbar(ScalarMappable(norm=norm, cmap="PiYG")) plt.show() counts, bins = np.histogram(np.log(ratios.ratio.to_numpy()), bins=500) plt.stairs(counts, np.exp(bins), fill=True) plt.xscale("function", functions=(lambda x: np.log(x), lambda x: np.exp(x))) ``` </details> I only benchmarked fast_accum=True and out_dtype=torch.bfloat16 supposing that these are the most commonly-used flags (e.g., with fast_accum=False row-wise scaling is much slower than tensor-wise scaling hence unpractical). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134781 Approved by: https://github.com/drisspg, https://github.com/eqy ghstack dependencies: #134773	2024-09-04 09:17:28 +00:00
Luca Wehrstedt	eec8fa038e	[fp8 rowwise] Support transposing operands in order to change output layout (#134773 ) On some occasion, a column-major output layout is more efficient (it's unclear if it's because of better store coalescing for some tile shapes, or whether it's just that it's CUTLASS's default and thus it's better optimized). At this stage I only add a flag that allows to transpose, but the hardest will be deciding on a new heuristic to turn it on selectively. This will be in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134773 Approved by: https://github.com/drisspg	2024-09-04 09:17:28 +00:00
Gregory Comer	679b8fe426	Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724 ) Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time. Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724 Approved by: https://github.com/mcr229	2024-09-04 08:45:46 +00:00
Pian Pawakapan	1dfb105239	restore CSE'd node metadata in runtime asserts pass (#134516 ) Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516 Approved by: https://github.com/ezyang	2024-09-04 05:56:28 +00:00
Avik Chaudhuri	9f00317997	rationalize STATIC vs. None (#134877 ) Summary: A bit of refactoring to prepare to remove `None` as a way to specify static dimensions in dynamic shapes, given we already have `Dim.STATIC` for the same purpose. We will now warn whenever this happens. However no tests were modified because problematic uses of `None` still need to behave as they do today, until we are ready to remove support. It should be easy to port tests by replacing the warning function to raise instead. Note that other uses of `None`, such as for entire values (tensor or non-tensor) remain as is. Moving forward this should be the only purpose of `None` (at least externally). Finally, there's a bit of confusion in our representation now because `AUTO` also internally transforms to `None`. Renamed dynamic_shapes to transformed_dynamic_shapes where this happens. Overall the two forms (pre and post transformation) have different properties so should probably not be represented in the same format in the future. Test Plan: existing Differential Revision: D62040729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134877 Approved by: https://github.com/pianpwk	2024-09-04 05:34:26 +00:00
Yu, Guangye	9809080b9e	[Reland] Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-09-04 05:31:08 +00:00
Xu Han	6448d351db	[inductor] clean up cpp_builder code. (#134909 ) Clean up cpp_builder duplication code. Hi @henrylhtsang , could you please help on land internally? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134909 Approved by: https://github.com/henrylhtsang	2024-09-04 05:29:08 +00:00
PyTorch UpdateBot	2c9b4d2052	[executorch hash update] update the pinned executorch hash (#135077 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135077 Approved by: https://github.com/pytorchbot	2024-09-04 05:17:29 +00:00
CaoE	6b05aafc57	Add specializations for VecMaskLoad and VecMaskCast (#126501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126501 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #126500	2024-09-04 05:12:52 +00:00
CK Luk	ffd1e214df	Back out "[FSDP2] Set `ctx.set_materialize_grads(False)` for post-backward (#133498 )" (#135059 ) Summary: Original commit changeset: 96513cbc425f Original Phabricator Diff: D61291210 There is some evidence that FB-FM-v4 has better NE with Set ctx.set_materialize_grads(False), especially when pairing up with prefetching. See https://www.internalfb.com/intern/anp/view/?id=5732259 Test Plan: export NUM_WORKERS=128 export BATCH_SIZE=1024 export CONFIG_FILE="mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2.yaml" export ENTITLEMENT=ads_global_tc_2k_training_large_short buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -c fbcode.platform010_cuda_version=12 -c hpc_comms.use_nccl=2.17.1 -- mode=${CONFIG_FILE} launcher.tags='[ads_ranking_taxonomy_monetization_genai]' launcher.data_project=pytorch_at_scale launcher.max_retries=10 launcher.fbl_entitl ement=${ENTITLEMENT} launcher.oncall=pytorch_training_enablement launcher.hardware=GRANDTETON launcher.num_workers=${NUM_WORKERS} data_loader.dataset.batch_size=${BATCH_SIZE} training.planner.proposer=dynamic_col_dim training.planner.proposer.optim_target=h bm 2>&1\| tee ~/tmp/log.mast Differential Revision: D62009163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135059 Approved by: https://github.com/awgu	2024-09-04 04:50:32 +00:00
cyy	c818ecd169	Remove Caffe2 code from tool scripts (#134941 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134941 Approved by: https://github.com/ezyang	2024-09-04 03:47:58 +00:00
Animesh Jain	9e6f4f3f77	[dynamo] Use __eq__ for backend match (#135039 ) Fixes https://github.com/pytorch/pytorch/issues/131150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135039 Approved by: https://github.com/jansel	2024-09-04 03:35:18 +00:00
dependabot[bot]	367a78495f	Bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows (#135068 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 2 to 4.1.7. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v2...v4.1.7) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-09-03 20:33:57 -07:00
Sam Larsen	362ecd9817	[inductor] Skip the sub-process pool until it's ready (#133508 ) Summary: Torch-compiling a quick script can be a bit slower than it needs to be: even though we initialize the subprocess pool early, it still might not be ready by the time we try to compile the first Triton kernel. Instead, let's use the single-threaded path until the pool has successfully completed a no-op job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133508 Approved by: https://github.com/Chillee	2024-09-04 03:26:55 +00:00
Justin Chu	7600e9b36f	[ONNX] Use the stable APIs in onnxscript and sync the latest logic (#134782 ) Use the stable apis from onnxscript: https://github.com/microsoft/onnxscript/issues/1827 Sync with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134782 Approved by: https://github.com/titaiwangms	2024-09-04 03:10:20 +00:00
Jason Ansel	982e27e532	[halide-backend] Update CI pin (#130258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258 Approved by: https://github.com/eellison	2024-09-04 03:08:49 +00:00
Rachel Guo	ae3aa8ff73	[AOTI][Tooling][5/n] Refactor the debug printer call to a level lower (#134789 ) Summary: 1. Move the debug printer call a level lower -> at here :https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335 2. Add UT for validating debug printer for user defined triton kernel codegen The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example, it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging Test Plan: ```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda``` Also verified that templateKernel codegen path still works Differential Revision: D61949020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789 Approved by: https://github.com/ColinPeppler	2024-09-04 02:41:30 +00:00
Bob Ren	ea89f01281	Remove unused comment (#135034 ) As part of my rampup I've been reading through some of @ezyang's diffs. I noticed in https://github.com/pytorch/pytorch/pull/133439 there was a comment that he forgot to remove. This diff removes that comment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135034 Approved by: https://github.com/albanD	2024-09-04 02:32:26 +00:00
Edward Z. Yang	175485097a	[EASY] Typofix (#135022 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135022 Approved by: https://github.com/albanD	2024-09-04 01:59:40 +00:00
Edward Z. Yang	15c25c4580	Fix dim mismatch logic automatic dynamic not working with compiler collectives (#135025 ) Fixes https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135025 Approved by: https://github.com/albanD	2024-09-04 01:50:21 +00:00
CaoE	4ebf6b04a8	Turn on expanded index path for Half on CPU (#133553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133553 Approved by: https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/peterbell10	2024-09-04 00:56:56 +00:00
Moritz Marseu	e000cf0ad9	Fix license metadata in setup.py (#129219 ) Package metadata in setup.py lists license as BSD-3 which is not a valid SPDX id. The correct id would be BSD-3-Clause. Specifying an SPDX id is beneficial to license compliance scanning. Taking up #129123 from my personal account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129219 Approved by: https://github.com/malfet, https://github.com/kit1980	2024-09-04 00:21:22 +00:00
Menglu Yu	45743019cf	[PT2][Optimus] Skip meta update on symblic shape (#134975 ) Summary: We noticed that there will be runtime error to do the dim broadcast when the meta example value has symbolic shape, thus we skip it. Test Plan: ``` buck2 run mode/opt //caffe2/benchmarks/dynamo/fb:torchbench_run_ads_dhen_5x_training -- -m ads_dhen_5x -t training ``` P1559019921 Differential Revision: D62115015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134975 Approved by: https://github.com/xuzhao9	2024-09-04 00:05:51 +00:00
Shivam Raikundalia	9ffcca7060	[Profiler] Handle Tensor Sizes/Strides Parsing Error (#134862 ) Summary: Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread. If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides. Differential Revision: D62008788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134862 Approved by: https://github.com/aaronenyeshi	2024-09-03 23:46:38 +00:00
Zain Rizvi	f05b716d6d	Add validator to ensure runner determinator script is kept in sync (#134800 ) We keep two copies of the runner-determinator script: 1. In runner_determinator.py, for ease of testing. This however is not actually executed during CI 2. Embedded in _runner-determinator.yml. This is what CI uses. Why the duplication? Short version: Because of how github CI works, during a given CI run the workflow yml files could actually come from the main branch, while the remaining files get read from the local commit. This can lead to a newer version of _runner-determinator.yml trying to invoke an older version of runner_determintor.py than it was actually designed for. Chaos ensues. We mitigate this by embedding the script into the yml file. But we still keep the script around because it's much easier to run tests against. This workflow's job is to ensure that if one edits the script in one of those two locations then they remember to update it in the other location as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/134800 Approved by: https://github.com/zxiiro, https://github.com/PaliC ghstack dependencies: #134796	2024-09-03 23:29:04 +00:00
Zain Rizvi	469429b959	Refactor runner determinator (#134796 ) Some minor refactorings to make the code easier to parse and easier to add unit tests for. Keeping this as a separate PR for ease of review, since it should have zero functional behavior changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134796 Approved by: https://github.com/zxiiro, https://github.com/PaliC	2024-09-03 23:29:04 +00:00
PyTorch MergeBot	c044deb9ce	Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )" This reverts commit f33bcbe5fd67e6b18be259ad2f0dc11c74157075. Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))	2024-09-03 22:35:14 +00:00
PyTorch MergeBot	2fd36086bc	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit 94db935749b8de99d8c3ab23fb880c67c8f3e67a. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/kit1980 due to See D62082697 ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2327542276))	2024-09-03 22:21:27 +00:00
drisspg	85fa019697	[Docs] Fix call to deprecated function (#135037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135037 Approved by: https://github.com/janeyx99, https://github.com/jbschlosser	2024-09-03 20:57:11 +00:00
rzou	14c8ef5198	autolabel aotinductor->export (#135040 ) "module: aotinductor" will automatically add "oncall: export". Test Plan: - none Pull Request resolved: https://github.com/pytorch/pytorch/pull/135040 Approved by: https://github.com/ydwu4	2024-09-03 20:17:51 +00:00
Xu Han	c40e622966	[inductor] add openmp config for intel conpiler on Linux. (#134973 ) Config `openmp` for Intel Compiler on Linux. Base on this PR, we can confirm the Intel optimized libraries are work built well. <img width="1039" alt="image" src="https://github.com/user-attachments/assets/838d5114-c778-4961-9cfe-39a814647089"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134973 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-03 20:10:21 +00:00
Driss Guessous	272f3b9fe1	[FlexAttention] Update tolerance for failing test (#135035 ) Summary: Address: T198937061 Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- --exact 'caffe2/test/inductor:flex_attention - test_no_q_info_compile_False (caffe2.test.inductor.test_flex_attention.TestBlockMask)' --run-disabled Differential Revision: D62137797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135035 Approved by: https://github.com/Chillee	2024-09-03 20:09:21 +00:00
Xilun Wu	e7731b3f8a	[TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882 ) D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via 1) explicit argument passing in user code when instantiating `MastRendezvousHandler` 2) pass `--use_libuv` command line argument to `torchrun`. The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch. PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type: when `USE_LIBUV="0"`, the non-libuv backend will be used. when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option. Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882 Approved by: https://github.com/shuqiangzhang	2024-09-03 19:43:21 +00:00
Nikita Shulga	71383dd3da	[MPS] Fix bachnorm_2d for channels last (#134618 ) By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing https://github.com/pytorch/pytorch/issues/134580 Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618 Approved by: https://github.com/albanD	2024-09-03 19:20:11 +00:00
Tobias Ringwald	758d787901	Added complex support for `torch.logsumexp` (#133187 ) Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`. Fixes #133047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-09-03 17:28:36 +00:00
Laith Sakka	6c3767452d	Move auto functionalize tests in their own test file (#134834 ) title + use `with torch.library._scoped_library as lib` when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134834 Approved by: https://github.com/zou3519 ghstack dependencies: #134831	2024-09-03 17:09:03 +00:00
Haibo Chen	2e0b114c06	add a new Guage API with an empty backend to PyTorch core (#134883 ) Summary: The current use case is to continuously measure the total allocated and reserved CUDA memory size from CUDACachingAllocator, and export their distribution (min, max, p90 etc) over time as timeseries. The current callback-based API does not work because the backend decides when the measurement is taken, so data points between two measurements may not be recorded. The distribution (e.g. max) as such will not be accurate. This new API closely follow the design of the existing WaitCounter API otherwise. This is not quite a synchronous version of DynamicCounter, as summing multiple data points does not make sense to my use case Test Plan: CI Differential Revision: D61837528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134883 Approved by: https://github.com/c-p-i-o	2024-09-03 17:08:47 +00:00
Nikita Shulga	7804c089c6	[BE] Update numpy version to 2.0.2 (#134875 ) It's long time to abandon pre-release version Partially addresses https://github.com/pytorch/pytorch/issues/134868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134875 Approved by: https://github.com/justinchuby, https://github.com/clee2000, https://github.com/kit1980, https://github.com/atalman, https://github.com/Skylion007	2024-09-03 17:02:35 +00:00
Justin Chu	1b9f51bd88	[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 ) Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748 Approved by: https://github.com/titaiwangms	2024-09-03 16:30:07 +00:00
PyTorch MergeBot	27677ead7c	Revert "[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 )" This reverts commit 6eed63c8b9c4f54a573bb51960d252cd42bfab0c. Reverted https://github.com/pytorch/pytorch/pull/133748 on behalf of https://github.com/ZainRizvi due to The version bump appears to be pulling in an unavailable numpy version? [GH job link](https://github.com/pytorch/pytorch/actions/runs/10686076754/job/29620426371) [HUD commit link](`6eed63c8b9`) ([comment](https://github.com/pytorch/pytorch/pull/133748#issuecomment-2326932868))	2024-09-03 16:19:47 +00:00
Edward Z. Yang	a258844a32	Properly handle empty CPUINFO variable (#134916 ) Fixes https://github.com/pytorch/pytorch/issues/134915 But I did not root cause why CPUINFO is totally empty to begin with... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134916 Approved by: https://github.com/Skylion007	2024-09-03 15:59:59 +00:00
PyTorch MergeBot	f927bcb934	Revert "[Inductor] Apply loop split optimization in codegen_node (#132389 )" This reverts commit 3cb5d251224b3fb59b5a10c6fefbb4c84eb565a6. Reverted https://github.com/pytorch/pytorch/pull/132389 on behalf of https://github.com/ZainRizvi due to Hi, this seems to be breaking in trunk. See test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10660461216/job/29556282081) [HUD commit link](`de3a641476`) ([comment](https://github.com/pytorch/pytorch/pull/132389#issuecomment-2326843129))	2024-09-03 15:40:45 +00:00
Justin Chu	6eed63c8b9	[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 ) Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748 Approved by: https://github.com/titaiwangms	2024-09-03 15:33:09 +00:00
IvanKobzarev	33ba952e31	[subclasses] Do not fakeTensor const prop subclass args (#134855 ) The issue: Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition. As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified. Solution: If we have subclasses arguments, do not count that const propagation is doable Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855 Approved by: https://github.com/zou3519	2024-09-03 13:31:49 +00:00
Edward Z. Yang	2a49296d75	Fix set_unbacked_bindings when list of Tensors is returned (#133585 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133585 Approved by: https://github.com/albanD	2024-09-03 12:23:31 +00:00
Feng Yuan	2443507acc	Update torch-xpu-ops pin (ATen XPU implementation) (#134983 ) Release cycle for PyTorch 2.5 1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue. 2. Refine test infrastructure for compatibility on different HW platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983 Approved by: https://github.com/EikanWang	2024-09-03 12:14:37 +00:00
Nikita Shulga	39935e0fde	Update cpuinfo submodule (#134891 ) Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891 Approved by: https://github.com/Skylion007	2024-09-03 09:29:59 +00:00
chilli	23a2161ad1	Changed addmv to be a decomposition and not a fallback (#134823 ) Overall seems to be faster ![image](https://github.com/user-attachments/assets/0cbea76e-fb78-4634-9265-047de0291549) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134823 Approved by: https://github.com/jansel ghstack dependencies: #134813, #134818, #134819	2024-09-03 06:33:31 +00:00
chilli	9856bc50a2	Switch nanmedian to not cuda synchronize (#134819 ) Generally, this seems to be faster. ![image](https://github.com/user-attachments/assets/43a86c6f-236d-4ba1-aae0-14e3d88ae401) And as an added benefit, it works great with cudagraphs and such :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134819 Approved by: https://github.com/Skylion007, https://github.com/eqy ghstack dependencies: #134813, #134818	2024-09-03 06:33:31 +00:00
chilli	6fce1faa10	change multinomial to use async asserts instead of a synchronization (#134818 ) Fixes https://github.com/pytorch/pytorch/issues/134442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818 Approved by: https://github.com/ezyang ghstack dependencies: #134813	2024-09-03 06:33:24 +00:00
chilli	db193d1e29	add msg to _assert_async (#134813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134813 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/albanD	2024-09-03 06:33:18 +00:00
leslie-fang-intel	d14fe3ffed	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-09-03 05:05:50 +00:00
CaoE	a00fad0177	Add specializations for vectorized conversion between float and BF16/FP16 (#126500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126500 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-03 02:09:12 +00:00
titaiwangms	45f11094b6	[ONNX] Delete `op_level_debug` from `torch.onnx.ExportOptions` (#134961 ) op_level_debug helped to identify missing operators, and wrongly implemented operators at the time that dynamo exporter relied on nearest matching and torchlib was just created. However, right now, with dispatcher logic improved and torchlib becomes mature, we no longer need it. PS: op-level-debug diagnostics rule is not deleted in this PR, as it auto generates lint error code, and need more time to fix. We can delete it when we retire sarif. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134961 Approved by: https://github.com/justinchuby	2024-09-02 23:38:39 +00:00
Xuehai Pan	4c1dd13ba3	[BE] better type annotation for `torch.types` (#129559 ) Closes #129525 - #129525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129559 Approved by: https://github.com/ezyang	2024-09-02 15:35:32 +00:00
Jonathan Wenger	76710d4f95	Corrected docstring of ``solve_triangular`` (#129766 ) Description The arguments docstring of [torch.linalg.solve_triangular](https://pytorch.org/docs/stable/generated/torch.linalg.solve_triangular.html#torch.linalg.solve_triangular) incorrectly describes the shape of the ``A`` argument if the option ``left=True``. The argument ``A`` should have shape $k \times k$ if ``left=False`` in line with the rest of the docstring and the implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129766 Approved by: https://github.com/lezcano	2024-09-02 13:30:30 +00:00
Edward Z. Yang	ee03530fd9	Add a test to avoid decorator based regression for cprofile traces (#133086 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133086 Approved by: https://github.com/aorenste	2024-09-02 12:53:34 +00:00
FEI	16de25b1dc	fix tensor_repr(at::Tensor) (#134762 ) (#134764 ) Fixes #134762 @ezyang @antocuni Pull Request resolved: https://github.com/pytorch/pytorch/pull/134764 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-02 06:05:10 +00:00
Blaine Burton Rister	3daca187aa	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-02 05:56:33 +00:00
PyTorch UpdateBot	de3a641476	[executorch hash update] update the pinned executorch hash (#134914 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134914 Approved by: https://github.com/pytorchbot	2024-09-02 03:52:40 +00:00
Sun, Jiayi	3cb5d25122	[Inductor] Apply loop split optimization in codegen_node (#132389 ) This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load. Example: ``` import torch import torch.nn as nn class GNReLU(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GNReLU, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return torch.nn.functional.relu(self.gn(x)) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GNReLU(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0))); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 14); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360L*x0)), 14); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-02 00:28:34 +00:00
Aaron Orenstein	c140fa1426	Reorg cache code to make it simpler (#134911 ) Summary: Pull the big nested function out of the middle of cached_autotune() into its own class. Also refactor creating the autotune cache itself out - which gets shared in the next diff. Test Plan: unit tests Differential Revision: D60677501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134911 Approved by: https://github.com/oulgen	2024-09-02 00:27:40 +00:00
Edward Z. Yang	0cbcef12bd	Stop adding useless prefix to error message here, you're pushing the important info off the screen. (#133108 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133108 Approved by: https://github.com/Skylion007	2024-09-01 23:11:17 +00:00
Edward Z. Yang	208442ea18	Don't setup try-except handler when Dynamo compiling (#133239 ) The reraise is not supported and so this just gunks up our actual exception handling. You can trigger this by hitting an exception inside of an NN module that has hooks on it. You end up graph breaking on the reraise here, and losing the inner stack trace from the actual exception that was raised. This might be kind of controversial. An alternate strategy is to support reraises in Dynamo or something but IDK this doesn't feel like the right place to apply force. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133239 Approved by: https://github.com/anijain2305	2024-09-01 22:26:46 +00:00
Edward Z. Yang	ea01aec8b1	Move FunctionSchema implementations to cpp file (#133856 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133856 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-09-01 19:50:35 +00:00
Oguz Ulgen	2dadc2c8fc	Log fx graph cache bypass reasons (#134792 ) Summary: Lets track when we bypass and why Test Plan: unit tests Differential Revision: D61994739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134792 Approved by: https://github.com/jamesjwu	2024-09-01 19:02:09 +00:00
cyy	1595e755af	[Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549 ) Reland of #121415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549 Approved by: https://github.com/ezyang	2024-09-01 15:15:38 +00:00
eqy	b1a00b7b6d	Abate `-Wsign-compare` warning spam in `Indexing.cu` (#134805 ) Fix for warning spam like ``` warning: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134805 Approved by: https://github.com/janeyx99	2024-09-01 10:48:07 +00:00
cyy	d03f767cae	Check function declarations of Vulkan code (#134550 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134550 Approved by: https://github.com/ezyang	2024-09-01 09:38:35 +00:00
Natalia Gimelshein	c25b64a057	expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919 ) …ered memory Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919 Approved by: https://github.com/eqy	2024-09-01 09:07:25 +00:00
Manuel Candales	caa04e0cae	[ET] codegen: bool array as array ref (#134886 ) Test Plan: CI Differential Revision: D62046959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886 Approved by: https://github.com/larryliu0820	2024-09-01 01:33:43 +00:00
Natalia Gimelshein	29b7852dc1	drop gil in couple places (leads to deadlocks) (#134910 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134910 Approved by: https://github.com/eqy	2024-09-01 00:05:53 +00:00
Aaron Orenstein	7239b8a4f1	Clean up RemoteCache classes (#134032 ) Summary: The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in. Update them to be more consistent: 1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile 2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only) 3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching. Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS. Test Plan: unit tests Reviewed By: oulgen Differential Revision: D61178859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032 Approved by: https://github.com/oulgen, https://github.com/bhack	2024-08-31 20:18:59 +00:00
Xu Han	590d96be64	[inductor] move test_fuse_large_params to slow test. (#134900 ) Move `test_fuse_large_params` to slow test. This case spend about 1.5 minutes. <img width="855" alt="image" src="https://github.com/user-attachments/assets/adf16dcf-d398-4d66-8dda-0c9cafc4e351"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134900 Approved by: https://github.com/jansel	2024-08-31 18:08:11 +00:00
haozhe.zhu	f4641ca481	[Inductor] Remove VecChecker and fallback non-supported Vec op to Scalar impl with a for loop (#134569 ) Fall back non-vectorized op by scalar impl + for loop. Example code: ``` cpp_fused_igammac_0 = async_compile.cpp_pybinding(['const double', 'const double', 'double'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, const double* in_ptr1, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(48L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = at::vec::VectorizedN<double,2>(tmp1); auto tmp3 = [&]() { __at_align__ std::array<double, 8> tmpbuf0; tmp0.store(tmpbuf0.data(), 8); __at_align__ std::array<double, 8> tmpbuf1; tmp2.store(tmpbuf1.data(), 8); __at_align__ std::array<double, 8> tmpbuf_out; for (int i = 0; i < 8; i++) { tmpbuf_out[i] = calc_igammac(tmpbuf0[i], tmpbuf1[i]); } return at::vec::VectorizedN<double, 2>::loadu(tmpbuf_out.data(), 8); } () ; tmp3.store(out_ptr0 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(48L); x0<static_cast<int64_t>(50L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = calc_igammac(tmp0, tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; } } } ''') ``` `frexp` are difficult to be handled by common `fallback` since it returns two `cse_var` `2ba60a1618/torch/_inductor/codegen/cpp.py (L752-L766)` So we added a special function to do that. ``` cpp_fused_frexp_0 = async_compile.cpp_pybinding(['const double', 'double', 'int32_t'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, double* out_ptr0, int32_t* out_ptr1) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(16L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); at::vec::Vectorized<int32_t> tmp1; at::vec::VectorizedN<double, 2> tmp2; [&]() { __at_align__ std::array<double, 8> tmpbuf; tmp0.store(tmpbuf.data(), 8); __at_align__ std::array<int32_t, 8> tmpbuf_exponent; __at_align__ std::array<double, 8> tmpbuf_mantissa; for (int i = 0; i < 8; i++) { tmpbuf_mantissa[i] = std::frexp(tmpbuf[i], &tmpbuf_exponent[i]); } tmp1 = at::vec::Vectorized<int32_t>::loadu(tmpbuf_exponent.data(), 8); tmp2 = at::vec::VectorizedN<double, 2>::loadu(tmpbuf_mantissa.data(), 8); } (); tmp2.store(out_ptr0 + static_cast<int64_t>(x0), 8); tmp1.store(out_ptr1 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(16L); x0<static_cast<int64_t>(20L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; int32_t tmp1; auto tmp2 = std::frexp(tmp0, &tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; out_ptr1[static_cast<int64_t>(x0)] = tmp1; } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134569 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-31 11:19:57 +00:00
Michael Lazos	16f119e62a	Update compiled optimizer tests for tensor betas (#134169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134169 Approved by: https://github.com/anijain2305, https://github.com/eellison ghstack dependencies: #134166, #134167, #134168	2024-08-31 10:24:39 +00:00
Michael Lazos	4e71418566	[dynamo] rewrite addcmul_ to remove graph break (#134168 ) Context: Adding support for the beta parameters to be tensors Details: Similarly to the previous two PRs addcmul_ is used with the tensor betas as the value argument. When this occurs, an item() call is invoked in the aten op. To avoid this graph break, addcmul_ is decomposed into its constrituent ops to avoid this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134168 Approved by: https://github.com/anijain2305 ghstack dependencies: #134166, #134167	2024-08-31 10:24:39 +00:00
Michael Lazos	3fb4c6bc38	[dynamo] Rewrite foreach pow to broadcast scalar argument (#134167 ) Context: Adding support for the beta parameters to be tensors Details: In this PR similarly to the previous, foreach_pow calls item() on the first argument when it is a scalar tensor. In this case, we broadcast that scalar tensor into a list of aliases of that tensor to avoid the item() call, and this results in a device copy of the scalar tensor. Once again, I dont think we can change the foreach_pow API due to BC concerns, so this op rewrite allows us to avoid a graph break, generate semantically the same code, and not affect eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134167 Approved by: https://github.com/anijain2305 ghstack dependencies: #134166	2024-08-31 10:24:35 +00:00
Michael Lazos	471c33f007	[dynamo] Rewrite foreach_lerp to avoid aten item call (#134166 ) Context: Adding support for the beta parameters to be tensors Details: In order to add support for the beta params to be tensors without graph breaks in the Adam family of optimizers it is necessary to support foreach_lerp(x, y, s) where s is a scalar tensor. Today, this isn't possible because when `s` is a scalar, internally the aten op calls item() on it to extract the value and distribute it to each of the ops on the individual list indices. To support this in dynamo without graph breaks, I decompose the lerp into its constituent ops which support a scalar tensor in the list argument positions which do not result in an item() call. To be clear the item() call is more performant for eager I think and for BC I don't think we can modify that API, so this allows us to have performance in eager and no graph breaks in compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134166 Approved by: https://github.com/anijain2305	2024-08-31 10:24:31 +00:00
Xuehai Pan	eed0d76682	[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876 Approved by: https://github.com/jansel ghstack dependencies: #133864, #133894	2024-08-31 10:08:07 +00:00
Xuehai Pan	ec660c383e	[dynamo] reduce overhead for `PolyfilledFunctionVariable.call_function` (#134842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134842 Approved by: https://github.com/jansel	2024-08-31 09:12:46 +00:00
cyyever	d9cc693719	[jit] Change argument names (#134828 ) It seems like a bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134828 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-31 08:42:30 +00:00
Xu Han	136badae64	[inductor] preload icx built in math libs (#134870 ) Intel Compiler implenmented more math libraries than clang, for performance proposal. We need preload them like openmp library. reproduce UT: ```cmd pytest test/inductor/test_cpu_cpp_wrapper.py -v -k test_silu_cpu_dynamic_shapes_cpp_wrapper ``` Depends of module: <img width="804" alt="Image" src="https://github.com/user-attachments/assets/9a672e03-ebf5-4ebb-b182-09180e6f7841"> Local test pass: <img width="857" alt="image" src="https://github.com/user-attachments/assets/afbb8c1c-8fcc-4d64-a3ad-c8521b137d2d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134870 Approved by: https://github.com/jansel	2024-08-31 04:50:31 +00:00
Yanbo Liang	090d9cf410	[Dynamo][autograd.Function][vmap] support torch._C._are_functorch_transforms_active (#134889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134889 Approved by: https://github.com/jansel	2024-08-31 04:39:09 +00:00
PyTorch UpdateBot	34b85d301f	[executorch hash update] update the pinned executorch hash (#134894 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134894 Approved by: https://github.com/pytorchbot	2024-08-31 04:16:41 +00:00
Alex Baden	64fad53b50	[Inductor] Support passing module map parameter to Triton make_ir API (#134774 ) In https://github.com/triton-lang/triton/pull/4539 the `make_ir` API was modified to accept a new `module_map` parameter. Update the Inductor callsite accordingly, preserving backwards compatibility following the existing code. Fixes #134674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134774 Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/jansel	2024-08-31 03:38:08 +00:00
Eddie Yan	aef5da50f4	Cleanup unused `pytorch.version` (#134381 ) This file doesn't seem to be used anywhere? checking CI... Pull Request resolved: https://github.com/pytorch/pytorch/pull/134381 Approved by: https://github.com/zou3519	2024-08-31 02:50:05 +00:00
PyTorch MergeBot	86e03a64e1	Revert "[Inductor] Allow customizing the padding format (#133939 )" This reverts commit 8b258b3b14408986a1d4142cff5a153c798ceecc. Reverted https://github.com/pytorch/pytorch/pull/133939 on behalf of https://github.com/ZainRizvi due to sorry but this PR is causing issues with diff train imports reverting it for now but it can be merged back in as-is ([comment](https://github.com/pytorch/pytorch/pull/133939#issuecomment-2322635388))	2024-08-31 00:38:30 +00:00
Nikita Shulga	f95085fd91	[BE][MPS] Prefer xfail to skip (#134858 ) This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858 Approved by: https://github.com/janeyx99	2024-08-31 00:29:48 +00:00
Yiming Zhou	050ad925f3	[benchmark] Add to torchbench relative path search (#134871 ) Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871 Approved by: https://github.com/FindHao	2024-08-31 00:28:22 +00:00
Xuehai Pan	a854c3a25e	[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894 Approved by: https://github.com/jansel ghstack dependencies: #133864	2024-08-31 00:17:27 +00:00
Xuehai Pan	ebbdeeede1	[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864 Approved by: https://github.com/jansel	2024-08-31 00:11:54 +00:00
Yichen Yan	5dad6a5a84	[ONNX][DORT] Lazy-import `onnxruntime` (#134662 ) Currently, if installed, `onnxruntime` will be imported when importing `torch._inductor` (which will be imported by some other library, e.g. transformer-engine): ``` /mnt/c.py(53)<module>() -> from torch._inductor.utils import maybe_profile /usr/local/lib/python3.10/site-packages/torch/_inductor/utils.py(49)<module>() -> import torch._export /usr/local/lib/python3.10/site-packages/torch/_export/__init__.py(25)<module>() -> import torch._dynamo /usr/local/lib/python3.10/site-packages/torch/_dynamo/__init__.py(2)<module>() -> from . import convert_frame, eval_frame, resume_execution /usr/local/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py(48)<module>() -> from . import config, exc, trace_rules /usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py(52)<module>() -> from .variables import ( /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py(38)<module>() -> from .higher_order_ops import ( /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py(14)<module>() -> import torch.onnx.operators /usr/local/lib/python3.10/site-packages/torch/onnx/__init__.py(62)<module>() -> from ._internal.onnxruntime import ( /usr/local/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py(37)<module>() -> import onnxruntime # type: ignore[import] ``` This issue breaks generated triton kernel because it imported torch, and unexpected runtime libraries as well. I've also added a test for this specific case under `test/onnx`, perhaps we should add more somewhere else? Related issue: https://github.com/huggingface/accelerate/pull/3056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134662 Approved by: https://github.com/justinchuby	2024-08-31 00:06:28 +00:00
Ratnam Parikh	2384f77d76	[XPU] Fix Windows XPU build (#134276 ) Linker flag check doesn't work correctly with MSVC and linking torch_xpu with torch_cpu_library for windows MSVC works without any errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/134276 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-08-30 23:51:40 +00:00
Yanbo Liang	e688b78791	[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872 ) Fixes #134820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134872 Approved by: https://github.com/zou3519	2024-08-30 22:24:18 +00:00
Blaine Burton Rister	8b258b3b14	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-08-30 20:34:11 +00:00
PyTorch MergeBot	a1ba8e61d1	Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 )" This reverts commit 5e8bf29148a590318f678620f84be8f4d5ffff5c. Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/ZainRizvi due to This still breaks linux binary builds. Added the appropriate labels to ensure tests can pass. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10626427003/job/29460479554) [HUD commit link](`5e8bf29148`) ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2322246198))	2024-08-30 20:00:41 +00:00
qchip	f6398eb0fa	dynamic shapes for combo_kenel/foreach_kernel (#134477 ) This PR add dynamic shapes support to foreach and combo kernels for horizontal fusion. A flag `combo_kernel_foreach_dynamic_shapes` (default False to avoid disturb production workflows) is added to _inductor/config.py. Setting it to True enables automatic dynamic shapes for foreach kernels. It is always enabled for combo kernels cases. Added unit cases. This PR also fixes a flaky test case for [T198833257](https://www.internalfb.com/intern/tasks/?t=198833257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134477 Approved by: https://github.com/mlazos	2024-08-30 19:58:20 +00:00
Wouter Devriendt	db17a9898d	regenerate ci workflows for binary builds with new g4dn runners (#133404 ) Fixes #103104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133404 Approved by: https://github.com/ZainRizvi	2024-08-30 19:53:22 +00:00
Gabriel Ferns	98b813d0d4	Enable cudagraphs in cpp wrapper (#133885 ) Fixes https://github.com/pytorch/pytorch/issues/130878 Summary: Enables cudagraphs in cpp wrapper by clearing inputs. Generated, non-cpp wrapper code: ```python def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (10, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) # Topologically Sorted Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, buf0, 10, grid=grid(10), stream=stream0) del arg0_1 return (buf0, ) ``` vs generated cpp wrapper code: ```python def _wrap_func(f): def g(args): input_tensors = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args] input_handles = torch._C._aoti.unsafe_alloc_void_ptrs_from_tensors(input_tensors) # new: args.clear() # end new output_handles = f(input_handles) output_tensors = torch._C._aoti.alloc_tensors_by_stealing_from_void_ptrs(output_handles) return output_tensors return g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133885 Approved by: https://github.com/eellison, https://github.com/desertfire	2024-08-30 18:48:37 +00:00
fduwjj	bdfa94b787	[RFC] Make fr trace script a console scripts (#134729 ) We want to make fr analyzer script a command after users `pip install torch`, that's why we want to mimic what torchrun is doing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134729 Approved by: https://github.com/c-p-i-o, https://github.com/malfet ghstack dependencies: #134528, #134780	2024-08-30 18:17:06 +00:00
Andrew Gu	a0d0c6b7e6	Used `torch.equal` in `test_foreach_copy_with_multi_dtypes` (#134861 ) `self.assertEqual` allows some tolerance, but here, we want to show that `_foreach_copy_` gives bitwise equivalent results. Let us use `torch.equal` then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134861 Approved by: https://github.com/Skylion007, https://github.com/janeyx99, https://github.com/crcrpar	2024-08-30 18:04:41 +00:00
fduwjj	1993a2aa9e	[FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780 ) Fixes a bunches of bugs in the script when running with the generated command and 3D parallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780 Approved by: https://github.com/c-p-i-o ghstack dependencies: #134528	2024-08-30 18:03:17 +00:00
Xu Han	15f5a4858b	[inductor] enable Intel Compiler(icx-cl) for inductor windows (#134772 ) This PR is enable Intel Compiler (`icx-cl`) for Windows inductor, likes previous PR: https://github.com/pytorch/pytorch/pull/134444 which enable clang. Changes: 1. Fix icx-cl crash by wrong decode args, the right decode should be "utf-8". 2. Add intel compiler check, and intel compiler Windows drivers check(icx-cl). 3. Add Intel compiler openmp args config. 4. Add intel compiler openmp binary preload. For intel compiler openmp binary path: <img width="788" alt="image" src="https://github.com/user-attachments/assets/54c76356-018d-4bef-a9b7-0ea150fd7aba"> For performance, Intel compiler(`icx-cl`) is much better performance than MSVC(`cl`): <img width="875" alt="image" src="https://github.com/user-attachments/assets/67865faf-b1de-4535-917a-486b72527204"> Append `clang-cl` performance data: <img width="821" alt="image" src="https://github.com/user-attachments/assets/476f4568-bf58-457f-b73d-4e57f49be384"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134772 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-30 17:51:46 +00:00
David Berard	9e0ddc0e14	[inductor] don't allow triton config pre_hook (#134633 ) The caching autotuner caches triton configs, and it doesn't try to hash or save the pre_hook from the config if it exists. If we had a config that had a pre_hook, then we might autotune -> save the config (without the pre_config) -> later load the saved config and try to run it, but this time without the pre_hook. So this PR adds an assert and deletes the pre_hook handling. We can be confident that we didn't have functional pre_hooks, because the pre_hook handling tries to use `self.arg_name`, which doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134633 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-08-30 17:39:37 +00:00
Masaki Kozuki	e21d7b77ce	Update `ForeachfuncInfo.sample_inputs_func` to yield scalars & scalarlists that are more friendly to test_meta (#134552 ) for `test_meta.py` to see more "PASSED" instead of "XFAIL". `pytest test_meta.py -k "_foreach_"` ran 6400 test cases and: - This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed - main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 Approved by: https://github.com/janeyx99	2024-08-30 17:30:50 +00:00
Animesh Jain	577a93514f	[dynamo][dynamic][heuristic] Mark tuple getitem integers as static (#134734 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134734 Approved by: https://github.com/jansel ghstack dependencies: #134653, #134713	2024-08-30 17:06:57 +00:00
Yifu Wang	08184aa85c	Add support for 32KB multi_tensor_apply kernel arguments (#134373 ) ## Benchmark On H100 SXM (HBM2e, 500W TDP), CUDA Toolkit=12.2, Driver Version=535.154.05, with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa) (`torch._foreach_copy_`): Baseline ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0g_x4sys device ms: 0.891, cpu ms: 7.200 memory bandwidth: 1457.727 GB/s ``` Single iteration trace: <img width="1432" alt="image" src="https://github.com/user-attachments/assets/8ef54365-0265-4281-a0f0-d4c2f448300e"> This PR ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp3jqiugli device ms: 0.683, cpu ms: 6.745 memory bandwidth: 1902.010 GB/s ``` Single iteration trace: <img width="1074" alt="image" src="https://github.com/user-attachments/assets/e52acad1-d09b-492c-9611-6d69e339f3ac"> ## Binary Size and Kernel Specialization The binary size for `libtorch_cuda.so` increased 6MB (243MB -> 249MB). ``` // NOTE: [32KB kernel argument size support] // 32KB kernel argument size support has three requirements: // - CUDART_VERSION >= 12010 // - Driver version >= 530 // - GPU arch >= VOLTA // // Due to minor version compatibility, it possible for binaries built with // CUDART_VERSION >= 12010 to run with driver version < 530. Since driver // version can only be checked at runtime, if CUDART_VERSION >= 12010, we have // to build both 4KB and 32KB kernels and determine the appropriate kernel to // dispatch at runtime. // // - If CUDART_VERSION < 12010, only 4KB kernels will be instantiated. // // - If CUDART_VERSION >= 12010: // - Host code: // - We always instantiate the launching stub for both 4KB and 32KB kernels. // - Device code: // - If __CUDA_ARCH__ >= 700, we always instantiate both 4KB and 32KB // kernels. // - If __CUDA_ARCH__ < 700, it's not possible to even compile an empty // 32KB kernel (formal parameter space overflowed). Thus, we only // instantiate a declaration for 32KB kernels. This is valid as long as the // declaration-only kernel is not launched. // // - At runtime, we dispatch to the 32KB kernel if driver version >= 530 and // GPU arch >= VOLTA. // // - TODO(yifu): once there's a CUDART version that is not compatible with any // driver version below 530, we can determine at compile time to not compile // the kernels for 4KB kernel argument size. // // https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134373 Approved by: https://github.com/eqy, https://github.com/crcrpar, https://github.com/janeyx99	2024-08-30 16:52:28 +00:00
Zhengxu Chen	a19a7524f6	[export] Make sure getitem replacement are synced with module call graph. (#134830 ) Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled Differential Revision: D62014035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830 Approved by: https://github.com/angelayi	2024-08-30 16:47:05 +00:00
Laith Sakka	f5b0caee71	Rewrite `unsafe_remove_auto_functionalized_pass` using `decompose_auto_functionalized` (#134831 ) `unsafe_remove_auto_functionalized_pass` can be written as using `decompose_auto_functionalized`, this way we do not have to update it each time we do a change to `auto_functionalize` (Ex https://github.com/pytorch/pytorch/pull/134409) , and we avoid duplicate logics implemented in two different ways. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134831 Approved by: https://github.com/zou3519	2024-08-30 16:27:53 +00:00
PyTorch MergeBot	351ba3e67c	Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 )" This reverts commit 65864d01341d006955579b145f78547314ceb14b. Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))	2024-08-30 16:27:40 +00:00
Thomas Bohnstingl	994438040c	Improvements for associative_scan - combine_mode (#133012 ) This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `combine_mode`, which can be either `pointwise` (default) or `generic`. In case of `generic`, the `associative_scan` is more flexible and allows also to perform non-pointwise functions. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307. @ydwu4 @Chillee @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133012 Approved by: https://github.com/ydwu4	2024-08-30 16:09:53 +00:00
PyTorch MergeBot	c6ecf57dd2	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit b5f1ffa7ab0988184497788f2738e1769888ab7d. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	7a85c488a8	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit eaa449fbf0fe528a0827ee9b5bcfcd307a7c658d. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	1ad08c7a5b	Revert "[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 )" This reverts commit 1b703669576223024eb84a76c53b7ec5ed8bb270. Reverted https://github.com/pytorch/pytorch/pull/133864 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	8aa44e14cf	Revert "[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 )" This reverts commit a2566adfb6064235db6d950568435fb6ef58a598. Reverted https://github.com/pytorch/pytorch/pull/133894 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:09 +00:00
PyTorch MergeBot	10c31e96df	Revert "[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 )" This reverts commit 7d12e6dceb94a221288f21c0e79ce8ca667d657a. Reverted https://github.com/pytorch/pytorch/pull/133876 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:09 +00:00
Yidi Wu	d261a1751a	[HOP] fix export x inline_inbuilt_nn_modules (#133731 ) TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))). We have two special treatments for following cases: 1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`, which mutates captured states when calling `__getattr__`. Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy. 2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ \| __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer. Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731 Approved by: https://github.com/anijain2305 ghstack dependencies: #134775	2024-08-30 15:58:20 +00:00
Yidi Wu	932c4ca5a0	make make_fx collective test single threaded (#134775 ) make_fx is not thread-safe due to mutating and patching global states. It's difficult and low roi to make it thread-safe so just turn the tracing test into a single-thread test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134775 Approved by: https://github.com/yifuwang	2024-08-30 15:58:20 +00:00
eqy	c07e566baf	[CUDA][P2P] Check device capability in `requires_cuda_p2p_access` (#134523 ) Tests seem to fail on e.g., Volta without this given the compile time meacros used e.g., in `79b7fff188/torch/csrc/distributed/c10d/intra_node_comm.cu (L487)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134523 Approved by: https://github.com/yifuwang, https://github.com/Skylion007	2024-08-30 14:08:55 +00:00
Joona Havukainen	92f282ca52	Enable batch matmul for result sizes > 232 the tensor can be split along batch axis (#133430 ) Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert. Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it: ``` import torch device='mps' a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device) b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device) res = torch.bmm(a, b) ``` Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 232. This lets us split up the computation along the batch axis to avoid going over the limit. Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-30 14:08:43 +00:00
wz337	50efbb9f1e	[DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603 Approved by: https://github.com/fduwjj ghstack dependencies: #133838, #133839, #134048	2024-08-30 08:13:37 +00:00
Animesh Jain	0f8bec4399	[dynamo] mark_static_nn_module (#134713 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 With this API, we can mark the offending module as static in detectron2. Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want. Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713 Approved by: https://github.com/jansel ghstack dependencies: #134653	2024-08-30 07:01:06 +00:00
Jason Ansel	a5630239ad	[dynamo] Improve minifier error message when fp64 not supported (#134737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134737 Approved by: https://github.com/anijain2305	2024-08-30 06:42:32 +00:00
Ankur Neog	1011e0ae98	Generalize devices specific UTs for dynamo (#130714 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices. We intend to extend similar generalization for the rest of the content in test/dynamo which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available. ## Changes carve out bert related content to another class use instantiate_device_type utility to instantiate this class for devices which support the functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714 Approved by: https://github.com/anijain2305	2024-08-30 05:02:47 +00:00
Animesh Jain	7a694f6683	[justknobs] Override __bool__ method (#134799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134799 Approved by: https://github.com/ezyang	2024-08-30 04:54:02 +00:00
PyTorch UpdateBot	75b86b1554	[executorch hash update] update the pinned executorch hash (#134736 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134736 Approved by: https://github.com/pytorchbot	2024-08-30 04:11:51 +00:00
Jack Taylor	5e8bf29148	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-08-30 03:38:35 +00:00
Xu Han	1f1e2eeb9d	[inductor] Install `tlparse` for test\dynamo\test_structured_trace.py UTs. (#134806 ) Install tlparse for test\dynamo\test_structured_trace.py UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134806 Approved by: https://github.com/ezyang	2024-08-30 03:16:03 +00:00
Laith Sakka	0d5f978795	add basic nn modules diff time benchmarks (#134658 ) benchmarks several shapes of basic nn modules. in both eager and inductor ``` collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 48602516013 compile time instruction count for iteration 1 is 20424350269 compile time instruction count for iteration 2 is 20440350455 compile time instruction count for iteration 3 is 20419269999 compile time instruction count for iteration 4 is 20430782200 compile time instruction count for iteration 5 is 20455049622 compile time instruction count for iteration 6 is 20157290712 compile time instruction count for iteration 7 is 20455324001 compile time instruction count for iteration 8 is 20450158317 compile time instruction count for iteration 9 is 20492987748 collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 961328334 compile time instruction count for iteration 1 is 958887896 compile time instruction count for iteration 2 is 958792214 compile time instruction count for iteration 3 is 958375977 compile time instruction count for iteration 4 is 958568525 compile time instruction count for iteration 5 is 958152305 compile time instruction count for iteration 6 is 959322800 compile time instruction count for iteration 7 is 958332703 compile time instruction count for iteration 8 is 958092100 compile time instruction count for iteration 9 is 958095277 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor compile time instruction count for iteration 0 is 3572145793 compile time instruction count for iteration 1 is 3503323973 compile time instruction count for iteration 2 is 3501962432 compile time instruction count for iteration 3 is 3501746084 compile time instruction count for iteration 4 is 3500687361 compile time instruction count for iteration 5 is 3822254676 compile time instruction count for iteration 6 is 3498356846 compile time instruction count for iteration 7 is 3499019157 compile time instruction count for iteration 8 is 3500780314 compile time instruction count for iteration 9 is 3500257458 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager compile time instruction count for iteration 0 is 1844838754 compile time instruction count for iteration 1 is 1843476862 compile time instruction count for iteration 2 is 1844761450 compile time instruction count for iteration 3 is 1845371742 compile time instruction count for iteration 4 is 1845159665 compile time instruction count for iteration 5 is 1845035802 compile time instruction count for iteration 6 is 1844895007 compile time instruction count for iteration 7 is 1844697922 compile time instruction count for iteration 8 is 1844780885 compile time instruction count for iteration 9 is 1844493990 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor compile time instruction count for iteration 0 is 1597839479 compile time instruction count for iteration 1 is 1348225351 compile time instruction count for iteration 2 is 1347340818 compile time instruction count for iteration 3 is 1348170800 compile time instruction count for iteration 4 is 1348637747 compile time instruction count for iteration 5 is 1678366444 compile time instruction count for iteration 6 is 1348412420 compile time instruction count for iteration 7 is 1348461578 compile time instruction count for iteration 8 is 1347420149 compile time instruction count for iteration 9 is 1349748195 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager compile time instruction count for iteration 0 is 137721777 compile time instruction count for iteration 1 is 139065517 compile time instruction count for iteration 2 is 137130552 compile time instruction count for iteration 3 is 137506030 compile time instruction count for iteration 4 is 137089838 compile time instruction count for iteration 5 is 137477395 compile time instruction count for iteration 6 is 138550452 compile time instruction count for iteration 7 is 137568409 compile time instruction count for iteration 8 is 136968468 compile time instruction count for iteration 9 is 137481664 collecting compile time instruction count for basic_modules_ModuleComparison_inductor compile time instruction count for iteration 0 is 917209684 compile time instruction count for iteration 1 is 899154426 compile time instruction count for iteration 2 is 898145079 compile time instruction count for iteration 3 is 899817018 compile time instruction count for iteration 4 is 899184687 compile time instruction count for iteration 5 is 898172885 compile time instruction count for iteration 6 is 899958951 compile time instruction count for iteration 7 is 899348186 compile time instruction count for iteration 8 is 897745404 compile time instruction count for iteration 9 is 899581123 collecting compile time instruction count for basic_modules_ModuleComparison_eager compile time instruction count for iteration 0 is 113165302 compile time instruction count for iteration 1 is 112724376 compile time instruction count for iteration 2 is 112774611 compile time instruction count for iteration 3 is 114465211 compile time instruction count for iteration 4 is 112689572 compile time instruction count for iteration 5 is 112726465 compile time instruction count for iteration 6 is 112853691 compile time instruction count for iteration 7 is 112295238 compile time instruction count for iteration 8 is 114022136 compile time instruction count for iteration 9 is 112664932 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649, #134652	2024-08-30 02:13:52 +00:00
Xilun Wu	a645a18d2e	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-30 02:13:45 +00:00
Chen Haifeng	27ffa67984	Support __class__ attr for tuple and list variables (#134099 ) Fixes #134086 This supports __class__ attribute for TupleVariable and ListVariable. And allows to construct a tuple or list by using __class__ attribute. This patch also fix a bug in NamedTupleVariable which misses a return on calling super var_getattr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134099 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-08-30 01:57:49 +00:00
Colin L. Rice	cf11fc0dcb	dynamo: Only log if we've disabled eval_frame once. (#134529 ) This spams logs pretty badly otherwise Pull Request resolved: https://github.com/pytorch/pytorch/pull/134529 Approved by: https://github.com/chuanhaozhuge, https://github.com/oulgen	2024-08-30 00:35:25 +00:00
Ivan Zaitsev	8b68912dfc	Correctly detect "Rate limit exceeded" error (#134785 ) Currently all 403 errors are treated as "Rate limit exceeded": https://github.com/pytorch/pytorch/actions/runs/10622019167/job/29445336924 [Github docs](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit) claim: > If you exceed your primary rate limit, you will receive a 403 or 429 response, and the x-ratelimit-remaining header will be 0. You should not retry your request until after the time specified by the x-ratelimit-reset header. After this change: https://github.com/pytorch/pytorch/actions/runs/10622365327/job/29446456395 Note, the 403 error in the jobs above is a separate issue, this PR addresses only the logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134785 Approved by: https://github.com/clee2000	2024-08-29 23:58:15 +00:00
Yu, Guangye	3402a5d865	fix windows xpu build issue (#133845 ) # Motivation If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`. # Solution Use explicitly `sycl-preview` in linux build only. # Additional Context For `find_library`, please note that the variable will not be updated if it has been stored. ``` If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845 Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet	2024-08-29 23:53:32 +00:00
leslie-fang-intel	3775fc982d	[Inductor][CPP] Fix Index name error (#134645 ) Summary Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect: * `_node` is a FX Node with target in ["index_expr", "load", "store"] * `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index` * `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645 Approved by: https://github.com/jgong5, https://github.com/masnesral	2024-08-29 23:33:15 +00:00
Shuqiang Zhang	d13ce2e2b5	[c10d] release gil lock during eager init (#134779 ) Summary: We found that if we init the pG in a background thread, it would block the main thread till init is complete. This is because in the pybinding we never release the GIL lock Test Plan: existing CI on eager init Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779 Approved by: https://github.com/c-p-i-o	2024-08-29 23:25:33 +00:00
Lucian Grijincu	71ff168dbb	pytorch: llvm_codegen: prefix JIT generated functions with 8B of data so jitted code can be called from ASAN+UBSAN on LLVM17 (llvm/llvm-project#65253) (#134572 ) Summary: Similar workaround was already applied elsewhere in pytorch https://github.com/pytorch/pytorch/pull/133623 {D61348865} LLVM17 UBSAN change discussion https://github.com/llvm/llvm-project/issues/104505 Here we also have to associate the data with the function with `setPrefixData(dummyPrefixData)` to prevent this workaround being disabled by the `optimize(*module_);` call which could change layout/remove the unused variable/etc. Differential Revision: D61845799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134572 Approved by: https://github.com/atalman	2024-08-29 23:15:13 +00:00
Laith Sakka	496e57283d	add add_loop benchmarks (#134652 ) This benchmark measure the cost of compiling the following function in eager and inductor its basically two benchmarks. ``` @torch.compile(backend=self.backend, fullgraph=True) def f(a, b): result = a.clone() for i in range(1000): if i % 3 == 0: result = result + b elif i % 3 == 1: result = result + 8 * b else: result = result.sin() return result ``` PYTHONPATH=$(pwd) python benchmarks/add_loop.py out ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8286649663 compile time instruction count for iteration 1 is 2838971338 compile time instruction count for iteration 2 is 2834263023 compile time instruction count for iteration 3 is 2829447493 compile time instruction count for iteration 4 is 2830904231 compile time instruction count for iteration 5 is 2830281077 compile time instruction count for iteration 6 is 2831466595 compile time instruction count for iteration 7 is 2830732164 compile time instruction count for iteration 8 is 2831088056 compile time instruction count for iteration 9 is 2831204407 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 32585687849 compile time instruction count for iteration 1 is 11747553436 compile time instruction count for iteration 2 is 11746959875 compile time instruction count for iteration 3 is 11749479461 compile time instruction count for iteration 4 is 11750053711 compile time instruction count for iteration 5 is 11750793958 compile time instruction count for iteration 6 is 11751673576 compile time instruction count for iteration 7 is 11754552912 compile time instruction count for iteration 8 is 11753723127 compile time instruction count for iteration 9 is 11759059942 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649	2024-08-29 23:04:01 +00:00
fduwjj	65864d0134	[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931 Approved by: https://github.com/H-Huang	2024-08-29 22:40:12 +00:00
Zhuoran Zhao	8b4c487581	Fix AOTInductor complication on ROCM (#134522 ) Summary: Original PR (https://github.com/pytorch/pytorch/pull/124123) is broken by cpp_builder refactoring So resubmit it to fix Test Plan: Test with command here: https://www.internalfb.com/phabricator/paste/view/P1549765548 Differential Revision: D61827208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134522 Approved by: https://github.com/frank-wei	2024-08-29 21:59:04 +00:00
Shunting Zhang	1e92d7b688	[inductor] move loop ordering after fusion (#126254 ) Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same. Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand). Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them. This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015 Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254 Approved by: https://github.com/jansel	2024-08-29 21:50:07 +00:00
min-jean-cho	416a7894fe	[Windows][XPU] Disable Kineto PTI on Windows only (#134620 ) Disable Kineto + XPU PTI on Windows only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620 Approved by: https://github.com/guangyey, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-29 20:58:55 +00:00
Xuehai Pan	7d12e6dceb	[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864, #133894	2024-08-29 20:56:16 +00:00
Xuehai Pan	a2566adfb6	[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864	2024-08-29 20:56:16 +00:00
Xuehai Pan	1b70366957	[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779	2024-08-29 20:56:16 +00:00
Xuehai Pan	eaa449fbf0	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769, #133778	2024-08-29 20:56:16 +00:00
Xuehai Pan	b5f1ffa7ab	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769	2024-08-29 20:56:16 +00:00
Xuehai Pan	e09324e7da	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel	2024-08-29 20:56:16 +00:00
drisspg	b977abd5de	[Inductor] Fix error checking for scaled_mm lowering (#134765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765 Approved by: https://github.com/Skylion007	2024-08-29 20:18:42 +00:00
atalman	6180574771	Move py 3.8->3.9 pull, trunk, inductor, prerioric CI tests (#133624 ) Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718 Except XPU and ROCM jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133624 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi	2024-08-29 19:15:59 +00:00
Jason Ansel	202e5cc87d	[inductor] Fix error in debug_str_extra (#134747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134747 Approved by: https://github.com/Skylion007, https://github.com/shunting314	2024-08-29 19:09:50 +00:00
Brian Vaughan	43e1df64f8	register all entry_point backends on first attempt (#132546 ) fixes: https://github.com/pytorch/pytorch/issues/131360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132546 Approved by: https://github.com/jansel	2024-08-29 18:59:29 +00:00
Ke Wen	5470fcd5b9	[5/N] Reconcile barrier and NaN checker (#134707 ) By using a zeros() tensor instead of empty() tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134345, #134357, #134701	2024-08-29 18:51:12 +00:00
zdevito	d91b49dbaa	expandable_segments <-> other allocator options (#134338 ) Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338 Approved by: https://github.com/ezyang	2024-08-29 18:43:59 +00:00
Rachel Guo	3fc6e47d42	[AOTI] Fix cosmetic indentation issue in cuda cpp wrapper codegen for DeferredCudaKernelLine/GridLine (#134705 ) Summary: Follow up fix for D61018114, D61800622 Increase indentation for `loadKernel` `launchKernel` and `Grid` lines. Test Plan: ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_unbacked_symbols_abi_compatible_cuda ``` ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_backed_symbols_abi_compatible_cuda ``` Differential Revision: D61927248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134705 Approved by: https://github.com/ColinPeppler	2024-08-29 18:38:45 +00:00
Aaron Gokaslan	5573c17877	[BE][Ez]: Update ruff to 0.6.3 (#134769 ) Mostly bugfix release, updating because it fixes an edgecase with a rule we are using Pull Request resolved: https://github.com/pytorch/pytorch/pull/134769 Approved by: https://github.com/albanD	2024-08-29 18:35:47 +00:00
Xintong Hu	ce96146623	[PT2] Fix node metadata setting in group_batch_fusion_aten (#134543 ) Summary: Current impl results in `meta` missing fields like`val`, use `FakeTensorProp` to update the information Differential Revision: D61832932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134543 Approved by: https://github.com/frank-wei	2024-08-29 18:32:04 +00:00
chilli	348d02a983	Changed masked out rows logsumexp to be -inf and not zero (#134650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134650 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng, https://github.com/drisspg	2024-08-29 17:22:52 +00:00
Pian Pawakapan	36a6516290	[export] use single FQN for param_buffer_mapping (#134500 ) Fixes #133252 In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks. Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500 Approved by: https://github.com/angelayi	2024-08-29 17:06:31 +00:00
Ke Wen	d9d95dc55e	[4/N] Test NaN checker against broadcast (#134701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134701 Approved by: https://github.com/wconstab ghstack dependencies: #134345, #134357	2024-08-29 17:00:07 +00:00
PyTorch MergeBot	ab646cd805	Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 )" This reverts commit ba5aec88c678fe4b9ad101602c29726724f56e21. Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))	2024-08-29 16:39:19 +00:00
Ke Wen	26aea277f7	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134345	2024-08-29 16:25:27 +00:00
Xu Han	d503217ea4	[inductor] calibration inductor windows uts (15/N) (#134586 ) Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows. Changes: 1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue. 2. `PermissionError` setup as `delete=False`, let file not be auto deleted. 3. Open log file as "utf-8" to align with Linux. 4. Process wrap difference for Windows. 5. Delete tmp file manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586 Approved by: https://github.com/jansel	2024-08-29 16:18:40 +00:00
Ke Wen	9953f55f4c	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-29 16:13:15 +00:00
Bin Bao	387d3fc296	[AOTI] Switch benchmarking to use export non-strict mode (#130977 ) Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977 Approved by: https://github.com/angelayi ghstack dependencies: #134639	2024-08-29 16:08:52 +00:00
Valentine233	0dbc72887b	[CPU][flash attention] make the stride of output align with input (#134656 ) Fixes #133671 Currently, the output of CPU flash attention has a fixed layout, no matter what the input is. This PR makes the stride of output align with input q/k/v, which is the same behavior as math backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134656 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-08-29 16:04:25 +00:00
Stonepia	4fcd15a667	Fix test_sgd_weight_decay_xpu accuracy error (#134744 ) Fixes #134743 This PR adds `test_sgd_weight_decay_xpu` in `KERNEL_COUNT_OVERRIDES` to override. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134744 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-08-29 15:12:40 +00:00
Animesh Jain	594162f7ab	[dynamo] Support reading attributes from pybind objects (#134630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134630 Approved by: https://github.com/jansel	2024-08-29 15:06:52 +00:00
Avik Chaudhuri	92e38a476f	preserve aten::to device in export training (#134622 ) Summary: With training IR, we cannot rely on trapping `to()` in `FunctionalTensor` because the regular decomposition kicks it first, and that can cause it to be optimized away. So instead we preserve it until we functionalize, and then replace it explicitly with `_to_copy()`. Test Plan: expected test failures go away Differential Revision: D61883878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134622 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-08-29 14:53:30 +00:00
rzou	092349dcdd	Never CSE aten.empty in the partitioner (#134703 ) aten.empty is almost always fusible into its consumer, so we never CSE it. This fixes a bug that looks like the following: ```py @torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"}) def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None: out_sin.copy_(x.sin()) out_cos.copy_(x.cos()) @torch.compile def f(x): out0 = torch.empty_like(x) out1 = torch.empty_like(x) sin_cos(x, out0, out1) return x.clone(), out0, out1 x = torch.randn(3, requires_grad=True) f(x) ``` - cse would de-duplicate the empty nodes - reinplacing would add an additional clone (because it can't write to both tensors at the same time) - the clone lowers into a new buffer + a copy_ kernel - the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional buffer, it doesn't matter what the values are. We could attempt to fix this on the reinplacing side but this seemed better as a partitioner heuristic and the reinplacing fix is a bit more tricky (we'd need to identify that the op never reads from the empty node). Test Plan: - new test (the old number was 27, the new number is 21, so this PR helped). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703 Approved by: https://github.com/yf225 ghstack dependencies: #134466, #134490, #134491	2024-08-29 13:51:19 +00:00
Xuehai Pan	70853b792a	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133801	2024-08-29 13:36:52 +00:00
Xuehai Pan	9e806c1a60	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305	2024-08-29 13:36:52 +00:00
Animesh Jain	d01a7a9faa	[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614 Approved by: https://github.com/awgu, https://github.com/yf225 ghstack dependencies: #134610, #134590, #134621	2024-08-29 09:14:42 +00:00
Animesh Jain	fb35d1e01f	[raland][dynamo][exceptions] Support raise from None (#134621 ) The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621 Approved by: https://github.com/jansel ghstack dependencies: #134610, #134590	2024-08-29 09:14:42 +00:00
Animesh Jain	2bf622685d	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134610	2024-08-29 09:14:42 +00:00
Animesh Jain	2446dead35	[dynamo][exceptions] Use exception subclass whenever possible (#134610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610 Approved by: https://github.com/drisspg, https://github.com/jansel	2024-08-29 09:14:42 +00:00
wz337	cfb642bb6b	[DTensor] Extend implicit replication to replicate DTensor for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551 ) Fixes [134212](https://github.com/pytorch/pytorch/issues/134212) Currently, when we use 2D FSDP with TP, `optimizer.step()` would fail if the model were not fully tensor parallelized. If we don't have the entire model tensor parallelized when doing 2D, we would have both 1D and 2D DTensor parameters. As foreach is turned on by default, `optimizer.step()` would fail as cross mesh op is not allowed. Error as follows: ``` NotImplementedError: aten._foreach_mul_.Scalar: DTensor does not support cross-mesh operation yet!Got meshes: DeviceMesh('cuda', [[0, 1], [2, 3]], mesh_dim_names=('dp', 'tp')) DeviceMesh('cuda', [1, 3], mesh_dim_names=('dp',)) ``` In this PR, we extend implicit_replication to replicate DTensor in missing dimensions for foreach ops. If users don't want to fully tensor parallelize the model when using 2D, they have the option of using the `implicit_replication()` context manager for `optimizer.step()`. In this case, we would swap out the 1D DTensorSpec and replace it with 2D DTensorSpec. However, we don't want to turn this on by default yet, as we want the users to be aware that the tp dimension is replicated if a layer is not tp-ed. With implicit implication turning on, try replicate dtensor spec in missing dimension would work for most cases for foreach case except when the first DTensor in the list is one that also need to be replicated. This is currently a limitation, which I don't have a good solution yet. Currently, with this change, we can handle most of the cases except the case that the first DTensor's ndim is not the largest. ``` [2D_DTensor, 1D_DTensor...] ---> Implicit_replication() can handle this. [1D_DTensor, 2D_DTensor...] ---> Implicit_replication() can't handle this. ``` This change doesn't affect the existing default behavior, as `implicit_replication()` is not turned on by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134551 Approved by: https://github.com/tianyu-l	2024-08-29 09:01:31 +00:00
Ke Wen	3645634f3c	[1/N] Move NaN check onto NCCL stream (#134300 ) So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-29 08:28:49 +00:00
Will Feng	578b8d75e5	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-29 06:28:16 +00:00
Xia, Weiwen	834d8b0965	[Inductor][mkldnn] Bug fix: incorrect codegen arg order for qconv (#134579 ) Fixes #133448 The arg order for mkldnn qconv IR became incorrect after PR #132367 . This PR fixes the bug. Test plan `python test/inductor/test_mkldnn_pattern_matcher.py -k qconv` `python test/inductor/test_cpu_cpp_wrapper.py -k qconv` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134579 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-08-29 06:20:52 +00:00
wz337	b0a6d9ad27	[DTensor] Add pointwise ops strategy for aten.isinf, aten.isneginf, aten.isposinf (#134699 ) Fixes #ISSUE_NUMBER Need it for https://github.com/facebookresearch/optimizers/blob/main/distributed_shampoo/utils/shampoo_preconditioner_list.py#L671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134699 Approved by: https://github.com/tianyu-l	2024-08-29 06:01:12 +00:00
Wang, Eikan	da9e61ef70	Get accumulate dtype for Intel GPU (#134465 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): There are two function variants to get accumulated dtype for a given dtype: - Func1: `c10::ScalarType toAccumulateType(c10::ScalarType type, c10::DeviceType device)` - Func2: `c10::ScalarType toAccumulateType(c10::ScalarType type, bool is_cuda)` The Func1 is general enough to support different devices, while the Func2 only supports CUDA and CPU. This PR intends to add the Intel GPU path in the Func1. And we expect users to invoke the Func1 to ensure compatibility for different devices. * __->__ #134465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134465 Approved by: https://github.com/Skylion007, https://github.com/atalman	2024-08-29 05:27:57 +00:00
Mikayla Gawarecki	94db935749	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-29 04:52:52 +00:00
Banit Agrawal	297b42012d	[PyTorch] Use pinned memory for zero_cuda_out (#134712 ) Summary: This diff creates a pinned tensor for copying from device for the zero_out op. Differential Revision: D61759262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134712 Approved by: https://github.com/zyan0	2024-08-29 04:46:08 +00:00
Jennifer (Jiyue) Wang	a32255481b	[caffe2][hipify] remove un-used flag from `pybind_utils.h` (#134404 ) Summary: Encountered issues related to AMD build when working on https://www.internalfb.com/diff/D60739324?dst_version_fbid=2203158110057105 (see stack trace P1545717562) Looking at the file history, seems that the flag is no longer used so I propose to remove it. Alternatively, I could change the `#ifdef` to check both `USE_C10D_NCCL` and `USE_ROCM` and include the corresponding AMD header files. Let me know what is more preferred way. Test Plan: Sandcastle Differential Revision: D61762129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134404 Approved by: https://github.com/malfet	2024-08-29 04:09:44 +00:00
Syed Tousif Ahmed	4655eb3ee2	Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685 ) Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685 Approved by: https://github.com/ezyang	2024-08-29 03:56:31 +00:00
David Berard	4b4ba7ab06	[NJT] Support NJT SDPA + meta-device flop counting (#134289 ) A user wants to use the flop counter with meta devices. This previously caused problems for SDPA+NJT: 1. autocast check: `torch.is_autocast_enabled("meta")` fails because `meta` is not valid for autocasting. If we skip this, we run into the next error 2. math backend: conversion to NST requires getting concrete offsets in a list of python integers, which doesn't work on a meta tensor `b2eb0e8c6a/torch/nested/_internal/sdpa.py (L809-L815)` 3. (fixed in the previous PR, #134288) - if we force using flash attention backend for flop counting, `_flash_attention_forward` previously didn't support meta tensors. In this PR, we check specifically for FlopCounterMode, and, if it's enabled and combined with meta tensors, (a) skip autocasting and (b) force it down the flash attention path. This isn't generally safe for tracing (e.g. if you actually care which kernels you are running), but in the absence of actual device information, we have to make some assumptions. By specifically checking for FlopCounterMode, this should reduce the chance of unintended side effects for other meta tensor users. Note: fake tensor would solve a bunch of these issues, but it's not a viable solution right now for the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134289 Approved by: https://github.com/soulitzer ghstack dependencies: #134288	2024-08-29 03:43:42 +00:00
CaoE	17e9c2d1e7	Add oneDNN support for Half LSTM on CPU (#132607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132607 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-29 03:40:10 +00:00
Ivan Zaitsev	41e36e2b46	Reflect check_labels status as a signal (#134711 ) Fixes the workflow when meta-exported diff (co-dev) doesn't have the required labels, but the signal is suppressed due to job failure (e.g. [see this run](https://github.com/pytorch/pytorch/actions/runs/10590994706/job/29347663526?pr=134484)). With this change the workflow status correctly reflects the status of the check. # Testing * [illegal pr_num](https://github.com/pytorch/pytorch/actions/runs/10603163898/job/29386843591) * [successful run](https://github.com/pytorch/pytorch/actions/runs/10603279052/job/29387230110) (topic label present) * no labels: [check fails](https://github.com/pytorch/pytorch/actions/runs/10603310368/job/29387333864) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134711 Approved by: https://github.com/clee2000	2024-08-29 03:11:16 +00:00
Yueming Hao	4f9c68454a	[inductor]Let output or input_as_strided match exact strides (#130956 ) Fixes #130394 TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables dense and non-dense outputs' strides follow the strides required by semantics. The comparison between the original and after this fix for the test is the below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 128 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 8 x1 = (xindex // 8) - x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (16x1)), xmask) tmp1 = tmp0 + tmp0 - tl.store(out_ptr0 + (x2), tmp1, xmask) + tl.store(out_ptr0 + (x0 + (16x1)), tmp1, xmask) def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (16, 8), (16, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) - buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32) + buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32) stream0 = get_raw_stream(0) triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0) del arg0_1 return (buf1, ) ``` The buf1 is created with exact stride required by users, and its values are written in same stride with the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire	2024-08-29 03:06:58 +00:00
PyTorch MergeBot	4811dc3de9	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit cc3a76edbac4a48381db6ccc44a83927f80c545b. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to Sorry but this has been discovered to be causing a performance regression internally ([comment](https://github.com/pytorch/pytorch/pull/133769#issuecomment-2316620213))	2024-08-29 03:00:47 +00:00
PyTorch MergeBot	f65df5edae	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 1dbd3476de07d7f07489e243cb7a43073e8c25c1. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))	2024-08-29 02:49:30 +00:00
PyTorch MergeBot	eaec9e80b8	Revert "[dynamo] simplify implementation for `os.fspath` (#133801 )" This reverts commit 74341e1150f10b8aaddd33a165e686724424071f. Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))	2024-08-29 02:49:30 +00:00
Jason Ansel	76f975948e	[inductor] Cleanup generate_node_schedule (#134306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134306 Approved by: https://github.com/shunting314	2024-08-29 02:45:14 +00:00
Sun, Jiayi	cccb121d4e	[Inductor] add inductor config: masked_vec (#134566 ) This PR adds inductor config: masked_vec to control enable/disable masked vectorization for the tail_loop, and enable by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134566 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-29 02:29:06 +00:00
Laith Sakka	c5f114747e	fix flakiness in update_hint_benchmark.py (#134649 ) ``` compile time instruction count for iteration 1 is 10732129038 compile time instruction count for iteration 2 is 10719776783 compile time instruction count for iteration 3 is 10729546868 compile time instruction count for iteration 4 is 10737655132 compile time instruction count for iteration 5 is 10732564252 compile time instruction count for iteration 6 is 10728721234 compile time instruction count for iteration 7 is 10733354271 compile time instruction count for iteration 8 is 10719588972 compile time instruction count for iteration 9 is 10706311856 ``` 1. add torch.manual_seed(0), inputs was not the same across iterations 2. disable gc. 3. remove loop (not needed since compilation happen once only) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649 Approved by: https://github.com/aorenste ghstack dependencies: #133834, #134635	2024-08-29 02:22:05 +00:00
PyTorch MergeBot	f0fceed432	Revert "[dynamo][exceptions] Use exception subclass whenever possible (#134610 )" This reverts commit 880e3d18a406777dbea6aeaf14443b0e3a8b441c. Reverted https://github.com/pytorch/pytorch/pull/134610 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	67d7040fce	Revert "[dynamo][dicts] Support hasattr on dicts (#134590 )" This reverts commit c566f2465f41b8081caed205fcf5fe973fd970b3. Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	40cebde3bc	Revert "[raland][dynamo][exceptions] Support raise from None (#134621 )" This reverts commit e96dc3665a1d48434c02e17f7faed41f779cee2c. Reverted https://github.com/pytorch/pytorch/pull/134621 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	c35d1f7b3a	Revert "[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 )" This reverts commit e4a5958ab58e2f9b5b9c336a1d2a6449784d88d3. Reverted https://github.com/pytorch/pytorch/pull/134614 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	25531eb735	Revert "[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 )" This reverts commit 26e392132d3039345de6aaf8643e7330f7fc3cbc. Reverted https://github.com/pytorch/pytorch/pull/134539 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134539#issuecomment-2316568257))	2024-08-29 01:59:02 +00:00
PyTorch MergeBot	cbf5ba1e97	Revert "[1/N] Move NaN check onto NCCL stream (#134300 )" This reverts commit 94caba4899096f160eca9628acddba6032755b3b. Reverted https://github.com/pytorch/pytorch/pull/134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	33d0c11b26	Revert "[2/N] Add flag to control which rank should perform NaN check (#134345 )" This reverts commit 2fe7e332c7a61f025ccbcdbbb4875c6bf0b9afdf. Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	43dc17fd00	Revert "[3/N] Set correct device to CUDA guards (#134357 )" This reverts commit afc76c6f2d46d7726012507ec5c67b4c04e21723. Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	503c0dd923	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit b6e51711a0ea6174806e75ab6e208d2d910b45f5. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/ZainRizvi due to Actually, seems like it was this commit that introduced the failure: test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604690725/job/29392898277) [HUD commit link](`b6e51711a0`) ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2316554188))	2024-08-29 01:42:52 +00:00
PyTorch MergeBot	1285443994	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit 202600bc2384cb19a29b8fca503bafc289158c32. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/mikaylagawarecki due to This is breaking Windows docs tests due to NamedTemporaryFile on Windows not working well ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2316543901))	2024-08-29 01:30:49 +00:00
Li-Huai (Allan) Lin	e7711d6c7d	[MPS] Fix SDP training (#134719 ) Check whether the input tensors require grad. If required, then we don't get into the fast path and fall back to composite implicit. Fixes #134678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134719 Approved by: https://github.com/malfet	2024-08-29 01:28:53 +00:00
Avik Chaudhuri	ca03a14cf7	hang dim hint constants off Dim (#134702 ) Summary: Retry landing https://github.com/pytorch/pytorch/pull/134484 Test Plan: (see original) Differential Revision: D61925860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134702 Approved by: https://github.com/pianpwk	2024-08-29 01:02:01 +00:00
Rachel Guo	7a554e96b4	[AOTI][Tooling] Follow up to print location of saved file path for `torch.pickle_save()` (#134651 ) Summary: - Follow up to add torch.pickle_save() log for saved file path - Minor debug printer code refine Test Plan: CI Differential Revision: D61883239 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134651 Approved by: https://github.com/muchulee8	2024-08-28 23:58:37 +00:00
Mikayla Gawarecki	202600bc23	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-28 23:53:17 +00:00
PyTorch MergeBot	f997b2b8e6	Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )" This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3. Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](`f685018ea9`) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))	2024-08-28 23:10:07 +00:00
Tugsbayasgalan Manlaibaatar	6dd3f81aaf	Add export_for_training as public API (#134677 ) Differential Revision: [D61912084](https://our.internmc.facebook.com/intern/diff/D61912084) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134677 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-08-28 22:32:10 +00:00
rzou	a7933acd5a	Improve custom ops aliasing error message (#134688 ) Fixes https://github.com/pytorch/pytorch/issues/134278 Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/134688 Approved by: https://github.com/yushangdi ghstack dependencies: #134466, #134490, #134491, #134690, #134692	2024-08-28 22:22:04 +00:00
rzou	dd443f418a	Improve opcheck docs. (#134692 ) Fixes https://github.com/pytorch/pytorch/issues/134119 From user feedback, it's difficult to understand what the tests do. We clarify the docs more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134692 Approved by: https://github.com/albanD ghstack dependencies: #134466, #134490, #134491, #134690	2024-08-28 22:22:04 +00:00
Ke Wen	afc76c6f2d	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345	2024-08-28 22:17:11 +00:00
rzou	5ff97e79ee	Skip test_mutable_custom_op_fixed_layout2 on ROCM (#134690 ) ROCM doesn't trigger the layout optimization that makes the test case valid so we're going to skip the checks. Should fix the following (I'll close them later) - https://github.com/pytorch/pytorch/issues/134481 - https://github.com/pytorch/pytorch/issues/134519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134690 Approved by: https://github.com/FindHao ghstack dependencies: #134466, #134490, #134491	2024-08-28 22:12:24 +00:00
Ke Wen	2fe7e332c7	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300	2024-08-28 21:53:39 +00:00
Janet Yang	26ec06e45d	[amd][lowering] hipify shim v2 headers (#134689 ) Summary: The default c_shim version was switched to 2 for HIP in D60674018. This results in some linking errors where shim function symbols are missing from the compiled .so file (eg. P1551186492) when building lowering benchmark scripts since the required files aren't included. Hipify the shim v2 generated header files as well since they're needed during codegen when the buck binaries are executed. Reviewed By: frank-wei, zoranzhao, henryoier Differential Revision: D61865202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134689 Approved by: https://github.com/zoranzhao	2024-08-28 21:53:24 +00:00
PyTorch MergeBot	7b3da5f297	Revert "[dynamo] Cache _dynamo.disable results (#134272 )" This reverts commit dbef2b05b4d81e891f7497f92f730a22bebe445d. Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/anijain2305 due to Peak mem increase detected internally ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2316308170))	2024-08-28 21:51:43 +00:00
Jia Li	20b62fed21	Create processes in parallel in mp.start_processes for forkserver (#134629 ) Summary: This is to fix the pytorch issue filed https://github.com/pytorch/pytorch/issues/133010 one way to fix this problem is to enable parallel start processes in mp.start_processes. What else in the diff: refactored a test case api_test which was repeating a lot of tests due to the inheritance. added unit test for forkserver when parallel start is on. Test Plan: Added unit tests Differential Revision: D61878552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134629 Approved by: https://github.com/d4l3k	2024-08-28 21:34:32 +00:00
Nowtryz	f685018ea9	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-08-28 21:30:39 +00:00
Nowtryz	b6e51711a0	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-28 21:28:23 +00:00
fduwjj	4c16797e71	[c10d FR analyzer] Output a meaningful debug report for users (#134528 ) - This PR generates a more useful output log for users: P1552399180. - It also fixes the logic when we check the all-gather size mismatch. - Add dtype check for collective input/output - We store more context information for error match_state so that we can report them in the file. - Disable the size match for alltoall because we don't log the size for all inputs/outputs. - Correct some types for func args specification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528 Approved by: https://github.com/c-p-i-o	2024-08-28 21:22:47 +00:00
Sanket Purandare	de35d3062f	Runtime Estimator for estimating GPU compute time (#134243 ) This PR adds a basic Runtime Estimator for single-device models. It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``. It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`). For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders. ``` import torch from torch import nn, optim from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": def _train_step( model: nn.Module, optimizer: optim.Optimizer, inp: torch.Tensor, ): out = model(inp) loss = out.sum() loss.backward() optimizer.step() optimizer.zero_grad() dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 32, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) runtime_estimator = RuntimeEstimator() with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) with runtime_estimator("operator-level-benchmark"): _train_step(model, optimizer, inp) with runtime_estimator("operator-level-cost-model"): _train_step(model, optimizer, inp) # Actual model runtime with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) warmup_iters, actual_iters = 2, 5 start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for _ in range(warmup_iters): _train_step(model, optimizer, inp) start_event.record() for _ in range(actual_iters): _train_step(model, optimizer, inp) end_event.record() torch.cuda.synchronize() measured_time = start_event.elapsed_time(end_event) / actual_iters print(f"Actual total_time: {measured_time:.3f} ms") ``` <img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c"> @weifengpy @xuanzhang816 @gnadathur Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243 Approved by: https://github.com/weifengpy	2024-08-28 20:06:54 +00:00
Manuel Candales	cae817c862	[ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245 ) Summary: Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h) Test Plan: CI Differential Revision: D61490943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245 Approved by: https://github.com/larryliu0820	2024-08-28 19:58:37 +00:00
Yidi Wu	b07d0a22f5	[hop] require hops to override __call__. (#134352 ) Fixes https://github.com/pytorch/pytorch/issues/133719 by making `__call__` of hops an abstractmethod. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134352 Approved by: https://github.com/zou3519	2024-08-28 19:56:40 +00:00
PyTorch MergeBot	66c33d5989	Revert "[2/N] Add flag to control which rank should perform NaN check (#134345 )" This reverts commit be7752ead3824e79f5ede6a2f59715b415a2f776. Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134345#issuecomment-2316133024))	2024-08-28 19:51:59 +00:00
PyTorch MergeBot	23e26b84af	Revert "[3/N] Set correct device to CUDA guards (#134357 )" This reverts commit 13114da4ef9d14978ea1dfc0fefb236cb4000435. Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134357#issuecomment-2316121423))	2024-08-28 19:44:55 +00:00
Gregory Comer	3b40b07efb	Update PyTorch for XNNPACK 87ee0b4 (#134518 ) Summary: Update XNNPACK library version. Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export). Differential Revision: D61822610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518 Approved by: https://github.com/mcr229	2024-08-28 19:24:04 +00:00
Animesh Jain	042b733ddd	[dynamo][freezing] Set is_static_type to false after marking an input static (#134653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134653 Approved by: https://github.com/mlazos	2024-08-28 19:22:37 +00:00
Andrew Gu	aa31e7019a	[FSDP] Made `clip_grad_norm_` norm compute order deterministic (#134673 ) Fixes https://github.com/pytorch/pytorch/issues/134393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134673 Approved by: https://github.com/weifengpy ghstack dependencies: #134152	2024-08-28 18:44:11 +00:00
Simon Fan	47ba47a81f	[compiled autograd] error instead of deadlock on reentrant autograd (#134530 ) reentrant calls autograd multiple times using the same thread, so it passes all the thread checks and hangs waiting for the lock it holds in another scope Pull Request resolved: https://github.com/pytorch/pytorch/pull/134530 Approved by: https://github.com/jansel ghstack dependencies: #134514	2024-08-28 17:54:31 +00:00
Simon Fan	c352b6aaaf	[compiled autograd][cpp node] point c++ custom autograd functions tracing error to google doc (#134514 ) `RuntimeError: Attempting to trace a potentially unsafe C++ autograd function: torch::autograd::CppNode<CustomOpAutogradFunction>. It may be possible to trace it safely, please refer to the instructions in: https://docs.google.com/document/d/11VucFBEewzqgkABIjebZIzMvrXr3BtcY1aGKpX61pJY/.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134514 Approved by: https://github.com/yf225	2024-08-28 17:54:31 +00:00
Xilun Wu	ba5aec88c6	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-28 17:51:44 +00:00
Bin Bao	310eb6d8c6	[AOTI] Fix test_aoti_inference CPU build issue (#134675 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/130311. We need to guard CUDA-only code in test_aoti_inference with macros so that it won't fail for CPU-only platform. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134675 Approved by: https://github.com/atalman, https://github.com/chunyuan-w	2024-08-28 17:42:19 +00:00
Laith Sakka	633a9a3b13	add back sum_floordiv benchmark. (#134635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134635 Approved by: https://github.com/avikchaudhuri, https://github.com/oulgen ghstack dependencies: #133834	2024-08-28 17:38:24 +00:00
Banit Agrawal	b8859dc4b8	[PyTorch Pin Memory Allocator] Optimize the free list implementation and add lock sharding (#134154 ) Summary: This diff addresses the lock contention issue in free list implementation of CachingHost/Pinned allocator. We add a different data structure for free list and also add lock sharding based on allocation size. Differential Revision: D61623367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134154 Approved by: https://github.com/guangyey, https://github.com/jgong5, https://github.com/zyan0, https://github.com/EikanWang, https://github.com/jiayisuse	2024-08-28 17:12:01 +00:00
Chien-Lin Chen	40de63be09	parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749 ) Fixes #123451 This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127. The test failure is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749 Approved by: https://github.com/janeyx99	2024-08-28 16:34:06 +00:00
Chien-Chin Huang	c7338f457c	[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 (#134158 ) The original DCP doesn't flattening all the containers, which can cause issues, https://github.com/pytorch/pytorch/pull/125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes https://github.com/pytorch/pytorch/issues/133923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn	2024-08-28 16:31:44 +00:00
PyTorch MergeBot	13d40f6fc5	Revert "hang dim hint constants off Dim (#134484 )" This reverts commit c142af7209a423a05504fdec50680333f5a37629. Reverted https://github.com/pytorch/pytorch/pull/134484 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134484#issuecomment-2315749549))	2024-08-28 16:05:42 +00:00
PyTorch MergeBot	2c88a923a7	Revert "Refactor caching device allocator utils (#130923 )" This reverts commit c45ca8092dddf718563a1a754de798ad25eae1ee. Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be causing internal tests to fail with errors like `error: no type named 'DeviceStats' in namespace 'xxx::xxx:xxxAllocator'; did you mean 'DeviceStatus'?` ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2315730155))	2024-08-28 15:56:08 +00:00
PyTorch MergeBot	d52aff3e73	Revert "Adding entry-point based support for out-of-tree rendezvous plugins (#132633 )" This reverts commit 136b19b062f62c81ea3ed8fb306debe9d7720e93. Reverted https://github.com/pytorch/pytorch/pull/132633 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing internal tests to fail with the error `ImportError: cannot import name '_register_out_of_tree_handlers' from 'torch.distributed.elastic.rendezvous.registry'` ([comment](https://github.com/pytorch/pytorch/pull/132633#issuecomment-2315716201))	2024-08-28 15:49:18 +00:00
chuanqiw	85d9946001	[CI] change conda to miniforge for XPU images (#134455 ) The `.ci/docker` change with `ciflow/xpu` label will trigger docker images rebuild on xpu runner, but xpu runner can't use miniconda, change to miniforge. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134455 Approved by: https://github.com/atalman	2024-08-28 15:16:14 +00:00
Mao, Yunfei	208b922327	[Intel GPU] Remove special dispatch logic for xpu in adaptive_avg_pooling (#132217 ) We now align the dispatch logic for XPU with CUDA in the adaptive average pooling operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132217 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/albanD, https://github.com/malfet	2024-08-28 15:06:35 +00:00
Bin Bao	e6bf1710ff	[Inductor][Refactor] Rename CPU benchmark test configs (#134639 ) Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639 Approved by: https://github.com/huydhn	2024-08-28 14:49:55 +00:00
Avik Chaudhuri	c142af7209	hang dim hint constants off Dim (#134484 ) Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things. Test Plan: existing Differential Revision: D61807361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484 Approved by: https://github.com/angelayi	2024-08-28 14:35:40 +00:00
Spencer Gibson	3e42f21eee	Bucketize fix to include number and tensor inputs (#133652 ) Fixes #132222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133652 Approved by: https://github.com/ezyang	2024-08-28 13:35:41 +00:00
IvanKobzarev	bb22132c8d	[aotd] Make effects op registry WeakKeyDictionary (#134470 ) Op is used as a Dictionary Key, while op can be deregistered as a result this Key will be holding this op from deallocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134470 Approved by: https://github.com/zou3519	2024-08-28 12:12:00 +00:00
Yanbo Liang	97c8a0739e	[Dynamo] Support inspect.signature.Parameter getattr (#134636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134636 Approved by: https://github.com/Chillee, https://github.com/anijain2305	2024-08-28 09:59:41 +00:00
Will Feng	26e392132d	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-28 08:57:56 +00:00
Yanbo Liang	8693322ef0	[Dynamo][autograd.Function] Support mark_non_differentiable (#134087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134087 Approved by: https://github.com/zou3519	2024-08-28 08:12:37 +00:00
Ke Wen	d01415409b	[PGNCCL] Improve logic to infer device for barrier (#134617 ) Fixes #134391, #124714 The above issues reported that `dist.barrier()` could hang in some cases. The culprit is that ProcessGroupNCCL inferred a wrong device to perform the dummy all-reduce. After the PR, the following will be the order of device selection: - 1st choice: `opts.device_ids`, if provided by user via `barrier(opts)`. - 2nd choice: bound device id, if provided to `init_process_group` via `device_id` arg. - 3rd choice: `usedDeviceIdxs_` recorded in current PG. Will have a value from previous collectives. - 4th choice: `globalRank() % localDeviceCount_`. This can only happen when `dist.barrier()` is the first call of the PG. What's new: - Added the 2nd choice. - In the 4th choice, we use `globalRank()` instead of group-local rank, because the group-local rank can be offset wrt the device id if intra-node GPUs are sharded into multiple dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134617 Approved by: https://github.com/yifuwang, https://github.com/shuqiangzhang	2024-08-28 08:12:09 +00:00
Animesh Jain	e4a5958ab5	[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614 Approved by: https://github.com/awgu, https://github.com/yf225 ghstack dependencies: #134610, #134590, #134621	2024-08-28 07:35:24 +00:00
Animesh Jain	e96dc3665a	[raland][dynamo][exceptions] Support raise from None (#134621 ) The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621 Approved by: https://github.com/jansel ghstack dependencies: #134610, #134590	2024-08-28 07:35:23 +00:00
Animesh Jain	c566f2465f	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134610	2024-08-28 07:35:18 +00:00
Animesh Jain	880e3d18a4	[dynamo][exceptions] Use exception subclass whenever possible (#134610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610 Approved by: https://github.com/drisspg, https://github.com/jansel	2024-08-28 07:35:12 +00:00
xingyuan li	bf7db4e4f9	[Inductor UT] Generalize inductor UT for intel GPU (#133309 ) [Inductor UT] Generalize Inductor test case for Intel GPU. - Reuse `test/inductor/test_decompose_mem_bound_mm.py` - Reuse `test/inductor/test_inplacing_pass.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133309 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf	2024-08-28 06:17:43 +00:00
haozhe.zhu	2ba60a1618	fix torch.prod vectorized path for bool (#128009 ) Fix https://github.com/pytorch/pytorch/issues/127866. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128009 Approved by: https://github.com/jgong5, https://github.com/albanD	2024-08-28 05:27:50 +00:00
Rachel Guo	89929d9abc	[AOTI][Tooling][4/n] Add `torch.save()` for individual intermediate tensor (#133871 ) Differential Revision: D61415304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871 Approved by: https://github.com/ColinPeppler	2024-08-28 04:48:00 +00:00
PyTorch UpdateBot	ca77f0a986	[executorch hash update] update the pinned executorch hash (#133386 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133386 Approved by: https://github.com/pytorchbot	2024-08-28 04:16:42 +00:00
PyTorch UpdateBot	e3308d835d	[audio hash update] update the pinned audio hash (#134632 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134632 Approved by: https://github.com/pytorchbot	2024-08-28 04:16:25 +00:00
cyy	bb4dfe90b8	[Reland] [1/N] Fix clang-tidy warnings in inductor (#134544 ) Reland #131979 and exclude aoti_torch_index_put_out changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134544 Approved by: https://github.com/ColinPeppler	2024-08-28 04:05:06 +00:00
Yiming Zhou	71d0eff6e7	Back out "[pytorch][PR] [export] Schematize nn_module_stack serialization" (#134628 ) Summary: Breaking backward compatibilities for serialization and deserialization Differential Revision: D61888223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134628 Approved by: https://github.com/angelayi	2024-08-28 03:45:46 +00:00
cyy	ec3f52dd27	[21/N] Fix clang-tidy warnings in jit (#134537 ) Follows #133399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134537 Approved by: https://github.com/Skylion007	2024-08-28 03:22:01 +00:00
Ke Wen	5beb859e74	[BE] no need to print stream in comm abort (#134362 ) Strictly speaking, NCCL communicator has nothing to do with CUDA streams. Thus, we don't need to print stream in comm abort's message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134362 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-08-28 02:14:18 +00:00
Tristan Rice	f33bcbe5fd	c10d/logging: add C10D_LOCK_GUARD (#134131 ) This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s. This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things. This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate. Test plan: existing CI for regressions will add unit tests on `C10D_LOCK_GUARD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-28 01:40:42 +00:00
Yu, Guangye	c45ca8092d	Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-08-28 01:35:23 +00:00
atalman	d96254631e	[CD] Fix docker builds by installing setuptools after python build (#134631 ) Follow up after https://github.com/pytorch/pytorch/pull/134595 Same error happens silently before the error addressed in the above PR (and build continues and builds invalid Docker): ``` #47 457.5 Traceback (most recent call last): #47 457.5 File "<string>", line 1, in <module> #47 457.5 File "/opt/_internal/cpython-3.12.0/lib/python3.12/site-packages/wheel/pep425tags.py", line 3, in <module> #47 457.5 import distutils.util #47 457.5 ModuleNotFoundError: No module named 'distutils' #47 457.5 + local abi_tag= #47 457.5 + ln -s /opt/_internal/cpython-3.12.0 /opt/python/ #47 457.5 + rm -f Python-3.12.0.tgz ``` The fix in https://github.com/pytorch/pytorch/pull/134595 is no longer needed since we will install setuptools right after python installation. Link: https://github.com/pytorch/pytorch/actions/runs/10584642913/job/29329366729#step:6:6041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134631 Approved by: https://github.com/kit1980	2024-08-28 01:17:41 +00:00
Sun, Jiayi	2b95da7ef4	allow conv_bn mixed dtype folding in post-grad (#133968 ) This PR relaxes the condition to allow conv_bn mixed dtype folding in post-grad. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133968 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-28 01:02:09 +00:00
FFFrog	f7467c3b95	using new device-agnostic api instead of old api like torch.cpu or torch.cuda (#134448 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134448 Approved by: https://github.com/guangyey, https://github.com/shink, https://github.com/albanD	2024-08-28 01:01:49 +00:00
Pian Pawakapan	0c7856973b	[export] enumerate unsupported sympy.Functions (#134271 ) (#134598 ) Summary: There's 2 concepts of unsupported sympy.Functions in symbolic_shapes: 1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions 2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 Differential Revision: D61863394 Pulled By: pianpwk Pull Request resolved: https://github.com/pytorch/pytorch/pull/134598 Approved by: https://github.com/angelayi	2024-08-28 00:34:38 +00:00
albanD	3b33f26513	Add device daemon (#131814 ) Base implementation aiming towards https://github.com/pytorch/rfcs/pull/64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/131814 Approved by: https://github.com/ezyang	2024-08-27 23:32:07 +00:00
Laith Sakka	d6091c8726	Add compile time instruction count metric (#133834 ) PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out as of this diff, compile_time_instruction_count counts the number of instruction from within convert_frame.compile_inner ``` update_hint_regression,compile_time_instruction_count,10522459165 ``` will add result from CI once populated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834 Approved by: https://github.com/aorenste	2024-08-27 23:29:02 +00:00
Max Podkorytov	ef0f5919c7	[ROCm][Inductor][CK] Fix codegen after ck signature change (#134483 ) MakeArgument signature was changed in https://github.com/ROCm/composable_kernel/pull/1453 adding splitK argument to universal gemm templates which are used to codegen addmm and matmul (part of the series started at #125453 ) # Testing `pytest test/inductor/test_ck_backend.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134483 Approved by: https://github.com/ColinPeppler	2024-08-27 23:25:42 +00:00
Pian Pawakapan	5ead965026	[export] don't duck size for DIM.AUTO (#134486 ) Summary: apparently DIM.AUTO leads to duck sizing, I didn't catch this. Doing the least intrusive fix possible by using `torch._dynamo.maybe_mark_dynamic()` under the hood. Test Plan: added test Differential Revision: D61809344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134486 Approved by: https://github.com/avikchaudhuri	2024-08-27 23:00:26 +00:00
PyTorch MergeBot	30094bedbc	Revert "[dynamo][dicts] Support hasattr on dicts (#134590 )" This reverts commit d23c0150f3ba5fd1162358e9e7b0e72e7308c87e. Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/anijain2305 due to causing trunk CI failures ([comment](https://github.com/pytorch/pytorch/pull/134590#issuecomment-2313705582))	2024-08-27 22:52:52 +00:00
drisspg	d966d91e37	[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538 Approved by: https://github.com/yanboliang ghstack dependencies: #134507, #134511	2024-08-27 22:04:57 +00:00
drisspg	f5c67917d3	[FlexAttention] Remove unused code (#134511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511 Approved by: https://github.com/yanboliang ghstack dependencies: #134507	2024-08-27 22:04:57 +00:00
drisspg	856a8410f2	[FlexAttention] Create new variables for the subgraphs (#134507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng	2024-08-27 22:04:57 +00:00
Nikita Shulga	41e512a4cd	[EZ] Restore `test_unicode_comments` (#134589 ) This reverts changes introduced by test_jit.py by `43737bd78a` and adds lint suppression for this it As test name suggests it should have an unicode comment to make sure our parser can handle it Part of the fix for https://github.com/pytorch/pytorch/issues/134422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134589 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-08-27 21:51:06 +00:00
Bob Ren	1ba39ec1d0	Add test case test_arange_length_with_float32_dtype (#134415 ) Adding a test as a followup from https://github.com/pytorch/pytorch/pull/134296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134415 Approved by: https://github.com/ezyang	2024-08-27 21:36:23 +00:00
PaliC	b58a0c3c4d	[split build] fix distributed problems (#134502 ) Should fix the issue where USE_C10D_NCCL was not getting propagated to libtorch_python.so Pull Request resolved: https://github.com/pytorch/pytorch/pull/134502 Approved by: https://github.com/yifuwang	2024-08-27 21:12:58 +00:00
David Berard	289486d007	Move attention kernels back from fake_impls to meta_registrations (#134288 ) See #121528 for additional context. In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA). Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels. Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR. Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288 Approved by: https://github.com/drisspg	2024-08-27 21:10:36 +00:00
rzou	39ca96398b	Update label_to_label with oncall: pt2 hierarchy. (#134582 ) Test Plan: - None Pull Request resolved: https://github.com/pytorch/pytorch/pull/134582 Approved by: https://github.com/clee2000	2024-08-27 21:05:40 +00:00
cyy	b567ca0f51	Remove unused imported names in python files (#134438 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134438 Approved by: https://github.com/zou3519	2024-08-27 20:44:04 +00:00
Animesh Jain	d23c0150f3	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134039	2024-08-27 20:43:40 +00:00
Bo Li	16b8146c9e	Exclude test_transformers and unit tests which require recent GPU arch (#132895 ) This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-08-27 20:40:53 +00:00
Yuanhao Ji	44dadf2506	[Fix] Check name when registering privateuse1 backend (#134071 ) do some checks when registering privateuse1 backend to avoid using in-tree deivce names Pull Request resolved: https://github.com/pytorch/pytorch/pull/134071 Approved by: https://github.com/albanD	2024-08-27 20:28:30 +00:00
Colin Peppler	f754c0ae1b	[easy] rm duplicate definition for inductor in TORCH_LOGS documentation (#134480 ) already defined in `2eb9339b71/torch/_logging/_internal.py (L286-L287)` Test Plan: Sandcastle run Differential Revision: D61806088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134480 Approved by: https://github.com/eellison, https://github.com/mlazos	2024-08-27 20:15:10 +00:00
Moritz Hennen	fe6d0e3a04	Do not compute unnecessary `tensor!=0` for bool tensors in `count_nonzero` (#134254 ) Updated aten/src/ATen/native/TensorAdvancedIndexing.cpp to only reduce non-bool tensors before computing a sum Since I have no expertise for MPS, I did leave the MPS backend untouched. Also, in `count_nonzero_impl` for CPU, I assumed the comparison can be optimized by the compiler for boolean values? `90c821814e/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L2262-L2264)` Fixes #133983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134254 Approved by: https://github.com/albanD	2024-08-27 20:09:29 +00:00
xpfjmj	b744ed6816	Add a cpu_dispatch_key parameter to the cpu_fallback function (#134321 ) Fixes #134322 Add a cpu_dispatch_key parameter to the cpu_fallback function to support fallback, for example, to SparseCPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134321 Approved by: https://github.com/albanD	2024-08-27 19:57:57 +00:00
Ivan Duka	adf401f822	Links to contributors' GitHub accounts (#133787 ) Maintainers have the links to their GitHub profiles, but the major contributors do not have them. I added the links to the contributors' GitHub accounts in case anyone wants to follow them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133787 Approved by: https://github.com/albanD	2024-08-27 19:56:08 +00:00
Nikita Shulga	534f43ddce	[Doc] Fix rendering of the unicode characters (#134597 ) https://github.com/pytorch/pytorch/pull/124771 introduced unicode escape sequences inside raw strings, which were not rendered correctly. Also fix typo in `\uue0 ` escape sequence (should have been `\u00e0`) Fix it by relying on [string literal concatenation](https://docs.python.org/3/reference/lexical_analysis.html#string-literal-concatenation) to join raw and regular strings together during lexical analysis stage Fixes https://github.com/pytorch/pytorch/issues/134422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134597 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-08-27 19:52:46 +00:00
Jerry Zhang	3ef4c27ab3	Update pt2e numeric debugger to use node.meta["custom"] field (#134040 ) Summary: With https://github.com/pytorch/pytorch/pull/131912 we now have a "custom" field in node.meta that can be preserved in * copy/deepcopy * run_decompositions() * serialization * re-exporting So we refactored numeric debugger to use this. Test Plan: python test/test_quantization.py TestNumericDebugger Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134040 Approved by: https://github.com/tarun292	2024-08-27 19:51:03 +00:00
Xu Han	ed494603c7	[inductor] calibration inductor windows uts (16/N) (#134587 ) skip UT for `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134587 Approved by: https://github.com/jansel	2024-08-27 19:45:02 +00:00
Xu Han	b094972051	[inductor] calibration inductor windows uts (17/N) (#134588 ) skip UTs for `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134588 Approved by: https://github.com/jansel	2024-08-27 19:41:17 +00:00
Xu Han	9d0e0e6f1d	[inductor] calibration inductor windows uts (14/N) (#134585 ) skip UT for `test/dynamo/test_exc.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134585 Approved by: https://github.com/jansel	2024-08-27 19:40:56 +00:00
Roy Hvaara	05ac7cd760	[MPS] Remove superfluous label/link (#134090 ) This was probably intended to be a comment. I removed it since the issue is already linked in the warning below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134090 Approved by: https://github.com/albanD	2024-08-27 19:37:33 +00:00
atalman	d5aefadb17	[CD] Fix docker builds by installing setuptools (#134595 ) Seeing failures like this: ``` #49 844.6 //build_scripts/manylinux1-check.py:6: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives ..... [python 3/3] RUN bash build_scripts/build.sh && rm -r build_scripts: 846.9 ...it did, yay. 846.9 + for PYTHON in '/opt/python/*/bin/python' 846.9 + /opt/python/cpython-3.12.0/bin/python build_scripts/manylinux1-check.py 847.0 Traceback (most recent call last): 847.0 File "//build_scripts/manylinux1-check.py", line 55, in <module> 847.0 if is_manylinux1_compatible(): 847.0 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 847.0 File "//build_scripts/manylinux1-check.py", line 6, in is_manylinux1_compatible 847.0 from distutils.util import get_platform 847.0 ModuleNotFoundError: No module named 'distutils' ------ ``` PR: https://github.com/pytorch/pytorch/pull/134455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134595 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-08-27 19:31:44 +00:00
Bin Bao	a4b44dd2ef	[AOTI] Introduce DeferredCudaGridLine for cuda cpp wrapper (#129268 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129268 Approved by: https://github.com/angelayi	2024-08-27 19:23:25 +00:00
Xinya Zhang	5fd670e0ef	[ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866 ) Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly Fixes #125230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet	2024-08-27 18:24:29 +00:00
PyTorch MergeBot	5b392d22c6	Revert "fix stuck floordiv (#134150 )" This reverts commit 92c4771853892193d73d87bd60eca4dc7efc51d8. Reverted https://github.com/pytorch/pytorch/pull/134150 on behalf of https://github.com/anijain2305 due to compile time regression internal ([comment](https://github.com/pytorch/pytorch/pull/134150#issuecomment-2313230404))	2024-08-27 18:23:44 +00:00
Xilun Wu	0159ebb654	[dtensor] add test for local_map decorator (#127752 ) Summary This PR is a follow-up of #126924 to address reviewer's comments: 1) add a test case to show the use of `local_map` as a function decorator. 2) simplify the logic of handling different data types of `out_placements`. 3) correct variable naming in test cases to match math formulas. Test see #126924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127752 Approved by: https://github.com/wanchaol	2024-08-27 18:22:23 +00:00
Nikita Shulga	8de0d7690c	Use newer `toAccumulateType` signature in `Normalization.cpp` (#134540 ) Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"` Fixes https://github.com/pytorch/pytorch/issues/134423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-27 18:09:20 +00:00
Jessica Vandebon	68b1a09422	Integrate device agnostic APIs in FSDP library [1/n] (#134337 ) Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA. Test Plan: CI Reviewed By: hanzlfs Differential Revision: D60587415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337 Approved by: https://github.com/LucasLLC, https://github.com/awgu	2024-08-27 17:31:11 +00:00
Colin Peppler	13049cd6e5	[aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520 ) ## Context In some user Triton kernels, we have this set-up for whatever reason. ``` @triton.jit def mykernel( param0, param1, param2, param3: tl.constexpr, # autotuned param4, # non-constexpr ): ... ``` This is an edge case because it's a general practice to declare all constexprs params at the end. And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument... ``` > void* kernel_args_var_0[] = {&var_0, &var_1, &var_2}; --- < CUdeviceptr var_3; < AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void*>(&var_3))); < void kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3}; ``` ## Root-cause * `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params. * `raw_args`, that gets passed to wrapper codegen, excludes autotuned args. * In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature. `79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)` ## Fix We try to fix this, by calculating the right constexprs wrt `raw_args`. An illustration ``` raw_args: [arg0, arg1, arg2, arg4] kernel.arg_names: [param0, param1, param2, param3, param4] kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names constexpr_indices: [] # this is correct wrt raw_args ``` Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520 Approved by: https://github.com/oulgen	2024-08-27 17:20:27 +00:00
Ke Wen	13114da4ef	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345	2024-08-27 16:38:15 +00:00
Ke Wen	be7752ead3	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300	2024-08-27 16:33:59 +00:00
Colin L. Rice	9dc4bd7466	Create a JustknobConfig for use in config (#134161 ) This is designed to be a more ergonomic interface on top of justknob_feature (see https://github.com/pytorch/pytorch/pull/134151 for just the PR with the base commits). The idea is that people stop having to think about this as much, and can just do JustkobsConfig("//the:thing", "FORCE_THING") and it'll do the right thing. Primarily sending this to see how people feel about the API, and using it for new config changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134161 Approved by: https://github.com/ezyang	2024-08-27 16:07:33 +00:00
Ke Wen	94caba4899	[1/N] Move NaN check onto NCCL stream (#134300 ) So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-27 16:02:27 +00:00
rzou	c582602245	Update partitioner's is_fusible heuristic to respect triton kernels (#134491 ) mutated arguments to triton kernels are fusible into the triton kernel. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134491 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466, #134490	2024-08-27 15:57:32 +00:00
wz337	761cf91e3c	[DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275 ) Adding a private helper method for Shampoo HSDP use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275 Approved by: https://github.com/XilunWu	2024-08-27 14:51:19 +00:00
Mikayla Gawarecki	d028b810fe	Fix flaky GroupNorm ModuleInfo test (#133899 ) Fixes https://github.com/pytorch/pytorch/issues/98677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899 Approved by: https://github.com/albanD	2024-08-27 14:45:51 +00:00
Mikayla Gawarecki	2033934ff8	Clarify error messages for NEWOBJ and BUILD in weights_only unpickler (#134346 ) Clarify that `add_safe_globals` will allow types for these instructions Some types do not appear as `GLOBAL` and are only caught in `BUILD`, example from hf slack is `numpy.dtypes.UInt32DType` ```python import torch import numpy as np from tempfile import TemporaryDirectory from pathlib import Path from codecs import encode torch.serialization.add_safe_globals([encode, np.dtype, np.core.multiarray._reconstruct, np.ndarray]) with TemporaryDirectory() as tempdir: p = Path(tempdir) r2 = np.random.get_state() torch.save(r2, p / "r2.pkl") torch.load(p / "r2.pkl", weights_only=True) ``` Yields (error comes from BUILD) ``` UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, parameter or OrderedDict objects, but got <class 'numpy.dtypes.UInt32DType'> ``` The reasoning is that `numpy.dtypes.UInt32DType` is constructed via `REDUCE` with `func =<class 'numpy.dtype'>` and `args= ('u4', False, True)`, clarify the error message that doing `add_safe_globals` on these will also allow them After this PR error message becomes ``` _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, Parameter, OrderedDict or types allowlisted via `add_safe_globals`, but got <class 'numpy.dtypes.UInt32DType'> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134346 Approved by: https://github.com/albanD	2024-08-27 14:45:39 +00:00
Mikayla Gawarecki	2ac710e667	Make torch.serialization.set_default_mmap_options usable as a context manager (#134371 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134371 Approved by: https://github.com/albanD	2024-08-27 14:45:29 +00:00
Nikita Shulga	0fa0ac80e4	Do not use `<filesystem>` on Linux (#134494 ) Because right now it leads to symbol conflict from binary builds. Use of `std::filesystem::file_exists` was introduced by https://github.com/pytorch/pytorch/pull/126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence. This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private Fixes symptoms of https://github.com/pytorch/pytorch/issues/133437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134494 Approved by: https://github.com/atalman, https://github.com/d4l3k	2024-08-27 14:44:10 +00:00
PyTorch MergeBot	3418708abf	Revert "[FlexAttention] Create new variables for the subgraphs (#134507 )" This reverts commit 4d0a44d34a46af6dcc764d55269b30ac537822a0. Reverted https://github.com/pytorch/pytorch/pull/134507 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
PyTorch MergeBot	87a3f664e1	Revert "[FlexAttention] Remove unused code (#134511 )" This reverts commit 767c47d3c0ee3fc7804918a08de3f94874143a03. Reverted https://github.com/pytorch/pytorch/pull/134511 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
PyTorch MergeBot	3e10a1eb5a	Revert "[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 )" This reverts commit a34320a6f225061a3b5fe130a5a8fe35ed7a40f9. Reverted https://github.com/pytorch/pytorch/pull/134538 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
rzou	c7cbcdad76	Update partitioner's is_fusible heuristic to respect auto_functionalized (#134490 ) We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. This PR also changes aten.empty to be recomputable w.r.t the Partitioner (it is, like aten.zeros, cheap to recompute and fusible into other ops). Fixes https://github.com/pytorch/pytorch/issues/134468 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134490 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466	2024-08-27 13:05:01 +00:00
xinyu-intel	dde5974b13	Implementation for rng ops on hpu and xpu (#133068 ) implementation for high_order_op::run_and_save_rng_state and high_order_op::run_with_rng_state on hpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/133068 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/anijain2305	2024-08-27 11:34:37 +00:00
FEI	ef8236f12b	Provide default value None for the attn_bias parameter(#133981 ) (#133986 ) Fixes #133981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133986 Approved by: https://github.com/ezyang	2024-08-27 11:10:43 +00:00
drisspg	a34320a6f2	[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538 Approved by: https://github.com/yanboliang ghstack dependencies: #134495, #134507, #134511	2024-08-27 09:53:19 +00:00
drisspg	767c47d3c0	[FlexAttention] Remove unused code (#134511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511 Approved by: https://github.com/yanboliang ghstack dependencies: #134495, #134507	2024-08-27 09:53:19 +00:00
drisspg	4d0a44d34a	[FlexAttention] Create new variables for the subgraphs (#134507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng ghstack dependencies: #134495	2024-08-27 09:53:13 +00:00
Zain Rizvi	f480385277	Remove explicit Amz2023 reference from jobs (#134355 ) Changes jobs to go back to using the default AMI. Note: This is only a cleanup PR. It does NOT introduce any behavior changes in CI Now that the default variant uses the Amazon 2023 AMI and has been shown to be stable for a week, it's time to remove the explicit amz2023 references and go back to using the default variant. After a week or two, when this is rolled out to most people, we can remove the variants from scale config as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134355 Approved by: https://github.com/jeanschmidt	2024-08-27 08:51:42 +00:00
Prashant Rawat	0916d72e99	Fix the warning for cat operators with same qparams (#133999 ) Summary: Currently the warning is printed when the cat inputs have same qparam, leading to a flood of warnings. This diff emits the warning only when cat inputs don't have the same qparam. Test Plan: CI Reviewed By: aprotopopov Differential Revision: D60638609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133999 Approved by: https://github.com/tarun292	2024-08-27 08:21:39 +00:00
wizzniu	3515090006	Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457 ) Fixes #134454 Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457 Approved by: https://github.com/shink, https://github.com/albanD	2024-08-27 07:13:36 +00:00
Sathyanarayanan Saravanamuthu	136b19b062	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/wconstab	2024-08-27 07:09:41 +00:00
Xu Han	4a18fcf7af	[inductor] calibration inductor windows uts (12/N) (#134428 ) enable Windows inductor UTs for `test/inductor/test_torchinductor_codegen_dynamic_shapes.py` Failed by depends on https://github.com/pytorch/pytorch/pull/134429, need to rebase after https://github.com/pytorch/pytorch/pull/134429 merged. ```cmd 2024-08-25T23:57:23.2747794Z Windows CI does not have necessary dependencies for test_torchinductor_dynamic_shapes yet 2024-08-25T23:57:23.2748541Z Traceback (most recent call last): 2024-08-25T23:57:23.2749593Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_codegen_dynamic_shapes.py", line 30, in <module> 2024-08-25T23:57:23.2750688Z from inductor.test_torchinductor_dynamic_shapes import ( 2024-08-25T23:57:23.2751877Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_dynamic_shapes.py", line 46, in <module> 2024-08-25T23:57:23.2752876Z raise unittest.SkipTest("requires sympy/functorch/filelock") 2024-08-25T23:57:23.2753545Z unittest.case.SkipTest: requires sympy/functorch/filelock 2024-08-25T23:57:23.2754077Z Got exit code 1 2024-08-25T23:57:23.2754874Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra) ``` Local test pass: <img width="1892" alt="image" src="https://github.com/user-attachments/assets/241ab082-6026-4f33-b3ac-7e9ef7da744d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134428 Approved by: https://github.com/jansel	2024-08-27 05:43:07 +00:00
Shivam Raikundalia	0b81f700aa	[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765 ) Summary: We want to add compile IDs and frames to each Torch-Compiled Region in order to help users cross reference the section they are checking alongside data obtained from tools, such as tlparse. This diff operates on the assumption that each graph section will enter and exit a CompileContext before it is ran to either compile the graph or look it up in the cache. Based on this assuption, we can save the value of the graph section from the exited CompileContext in eval_frame.c using a Python C API. After this, we can create a new interface in cpp shim to wrap around the record_function in order to pass in the new keyword argument for "context". Test Plan: Enhance test_profiler_dynamo_compiled_region to look for kwinputs as well as a name to see that the context is now labeled. Also changed test to run graph with more contexts so that we test a wider range of profiling. Differential Revision: D60803317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132765 Approved by: https://github.com/anijain2305	2024-08-27 04:55:04 +00:00
Shuai Yang	de57a6e806	Back out "[dynamo][exception] Support raise exception from None (#134028 )" (#134513 ) Summary: The original diff is causing the error "attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype ‘float". The context is in: https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/ Test Plan: After reverting, the above issue is gone, details are in https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/ Differential Revision: D61820520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134513 Approved by: https://github.com/anijain2305	2024-08-27 02:57:14 +00:00
Xu Han	02b0b524b5	[inductor] Turn on UT: test_randint_int64_mod (#134510 ) It fixed by https://github.com/pytorch/pytorch/pull/134229, turn on it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134510 Approved by: https://github.com/ezyang	2024-08-27 02:33:07 +00:00
Xuehai Pan	d0147290d8	[BE][Easy][dynamo] ensure `trace_rules.MOD_INLINELIST` in alphabetical order (#134246 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #134246 * #133987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134246 Approved by: https://github.com/yanboliang	2024-08-27 02:29:43 +00:00
cyy	2ee201a7d0	[CMake] Remove BUILDING_WITH_TORCH_LIBS (#134434 ) Since BUILDING_WITH_TORCH_LIBS is not used now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134434 Approved by: https://github.com/ezyang	2024-08-27 01:48:21 +00:00
Edward Z. Yang	bdfc1d3987	Remove unnecessary expect_true in split_with_sizes (#133439 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133439 Approved by: https://github.com/albanD	2024-08-27 01:34:00 +00:00
Edward Z. Yang	c7ca89a11a	Improve print stack/locals printing in comptime (#133651 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133651 Approved by: https://github.com/anijain2305	2024-08-27 01:29:50 +00:00
rzou	58771315d3	Unify lowerings for auto_functionalized and triton_kernel_wrapper_functional (#134466 ) Fixes https://github.com/pytorch/pytorch/issues/134372 The triton_kernel_wrapper_functional lowering was causing problems (it was generating small kernels with nans in it, probably from realizing aten.empty nodes. Instead of having its own manual lowering, we change triton_kernel_wrapper_functional to go the same route as auto_functionalized where we decompose the node into clone + mutation nodes. Test Plan: - new test - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134466 Approved by: https://github.com/oulgen, https://github.com/eellison ghstack dependencies: #134364	2024-08-27 00:53:05 +00:00
PyTorch MergeBot	141a9c7204	Revert "[export] enumerate unsupported sympy.Functions (#134271 )" This reverts commit ddd71e34797f3bb56a048058e007a2df87c5755f. Reverted https://github.com/pytorch/pytorch/pull/134271 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134271#issuecomment-2311353460))	2024-08-27 00:45:00 +00:00
drisspg	4df10a6340	[FlexAttention] Fix bug when checking whether to return LSE (#134495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134495 Approved by: https://github.com/yanboliang, https://github.com/Chillee, https://github.com/BoyuanFeng	2024-08-27 00:31:46 +00:00
Xu Han	b98d33c155	[inductor] calibration inductor windows uts (13/N) (#134429 ) enable Windows inductor UTs for `test/inductor/test_torchinductor_dynamic_shapes.py` Local test pass: <img width="1885" alt="image" src="https://github.com/user-attachments/assets/4b96b6d9-715f-4c94-8059-9ee0afaaa574"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134429 Approved by: https://github.com/jansel	2024-08-27 00:16:16 +00:00
Xuehai Pan	74341e1150	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305 ghstack dependencies: #133771	2024-08-27 00:08:04 +00:00
Xuehai Pan	1dbd3476de	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel	2024-08-27 00:08:04 +00:00
CK Luk	43bbd781f2	Back out "[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532 )" (#134478 ) Summary: Original commit changeset: 0215a41433e9 Original Phabricator Diff: D61432583 D61432583 causes FSDP2 stuck in PT2 compilation when applied to FB-FM-v4. With D61432583: https://www.internalfb.com/mast/job/aps-ckluk-745e763d6a After backing out D61432583: https://www.internalfb.com/mast/job/aps-ckluk-f9604ea1f9 Test Plan: hg graft D61774888 scripts/ckluk/aps/mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2_qps.sh Differential Revision: D61802689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134478 Approved by: https://github.com/yf225	2024-08-27 00:07:28 +00:00
Xinya Zhang	46ecc673ae	[ROCm] Prevent accidental enablement of efficient attention. (#133331 ) Currently Efficient attention and Flash attention share the same set of GPU kernels on ROCM and have common limitations on head sizes. Fixes https://github.com/pytorch/pytorch/issues/132004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133331 Approved by: https://github.com/malfet, https://github.com/jithunnair-amd	2024-08-27 00:03:45 +00:00
xinan.lin	0be6584203	[Inductor UT] Refine test case `test_codegen_upcast_to_fp32_upcast` to pass on XPU. (#134474 ) [Inductor UT] Refine test case test_codegen_upcast_to_fp32_upcast to pass on XPU. Fix issue: #134476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134474 Approved by: https://github.com/jansel	2024-08-26 23:59:26 +00:00
Roy Hvaara	1565940114	[MPS] Add `test/test_nn.py` to test suite (#134184 ) This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS. Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs. ```bash $ python test/run_test.py --mps --verbose -k TestNN Running test batch 'tests to run' cost 84.76 seconds ``` Ref #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-26 23:48:23 +00:00
Nikita Shulga	79b7fff188	Fix docstring for torch.signal.windows.nuttall (#134512 ) This partially fixes regression introduced by https://github.com/pytorch/pytorch/pull/124771 but also just improves `z_n` rendering, by using MathML In 2.3 it was [rendered](https://pytorch.org/docs/2.3/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as <img width="177" alt="image" src="https://github.com/user-attachments/assets/2c15d1f9-13ad-483f-bb66-41fa3fa4ba9c"> With this change it'll be [rendered](https://docs-preview.pytorch.org/pytorch/pytorch/134512/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as ```math z_n = \frac{2 \pi n}{M} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134512 Approved by: https://github.com/kit1980, https://github.com/aorenste, https://github.com/atalman	2024-08-26 22:51:43 +00:00
Pian Pawakapan	ddd71e3479	[export] enumerate unsupported sympy.Functions (#134271 ) There's 2 concepts of unsupported sympy.Functions in symbolic_shapes: 1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions 2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases. Differential Revision: D61677956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134271 Approved by: https://github.com/avikchaudhuri	2024-08-26 22:44:12 +00:00
Benjamin Glass	55236d0cb7	TestForeach::test_parity: Remove check for error message text (#134251 ) Previously, error messages were expected to be string equivalent to error messages thrown by the ref function. This check fails for dozens of torch functions, and doesn't appear to add much value for the end user. This commit removes this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253, #134344	2024-08-26 22:40:54 +00:00
Benjamin Glass	ef8c474fcf	Add the fast path for bfloat16 lgamma (#134344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253	2024-08-26 22:40:54 +00:00
Benjamin Glass	3c5883e550	Fix test_parity xfail for sigmoid (#134253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 22:40:54 +00:00
soulitzer	a23dae22d5	Update AC pass use_reentrant message (#134472 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134472 Approved by: https://github.com/albanD	2024-08-26 21:57:38 +00:00
Animesh Jain	dbef2b05b4	[dynamo] Cache _dynamo.disable results (#134272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272 Approved by: https://github.com/yf225, https://github.com/jansel	2024-08-26 21:04:15 +00:00
Aidyn-A	28a4db84f2	[ARM] Fix infinite recursion in unwind (#134387 ) Fixes #119905 The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: `5ad759ca33/torch/csrc/profiler/combined_traceback.cpp (L25)` then the unwind itself fails: `5ad759ca33/torch/csrc/profiler/unwind/unwind.cpp (L10-L12)` and it causes another attempt to unwind the failure in `unwind()`... In summary, the executed instruction is equivalent to: ```C++ std::vector<void*> unwind() { // some instructions ... return unwind(); } ``` This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace. Huge thanks to @nWEIdia who found the root cause! Pull Request resolved: https://github.com/pytorch/pytorch/pull/134387 Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet	2024-08-26 21:02:31 +00:00
Xu Han	900c5083ed	[inductor] calibration inductor windows uts (9/N) (#134425 ) enable Windows inductor UTs of `test/inductor/test_binary_folding.py` Failed UT depends on https://github.com/pytorch/pytorch/pull/134427 Need to rebase after https://github.com/pytorch/pytorch/pull/134427 merged. ```cmd 2024-08-25T23:32:23.0905727Z Traceback (most recent call last): 2024-08-25T23:32:23.0906516Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_binary_folding.py", line 18, in <module> 2024-08-25T23:32:23.0908200Z from inductor.test_inductor_freezing import TestCase 2024-08-25T23:32:23.0909883Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_inductor_freezing.py", line 39, in <module> 2024-08-25T23:32:23.0911128Z raise unittest.SkipTest("requires sympy/functorch/filelock") 2024-08-25T23:32:23.0911801Z unittest.case.SkipTest: requires sympy/functorch/filelock 2024-08-25T23:32:23.0912370Z Got exit code 1 2024-08-25T23:32:23.0913155Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra) ``` Local test pass: <img width="1898" alt="image" src="https://github.com/user-attachments/assets/4a6e3f66-4bbc-4aab-8f0d-2e2318046e53"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134425 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-08-26 20:57:41 +00:00
Animesh Jain	68624cf089	[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 ) Hard to write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354 Approved by: https://github.com/jansel	2024-08-26 20:48:57 +00:00
Nikita Shulga	af82dc816a	Fix lint failures (#134488 ) Introduced by https://github.com/pytorch/pytorch/pull/131000 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134488 Approved by: https://github.com/Skylion007, https://github.com/msaroufim, https://github.com/albanD, https://github.com/atalman	2024-08-26 20:13:21 +00:00
albanD	2588b5e51a	Move module_tracker to logging for confused hierarchy (#134467 ) Fixes https://github.com/pytorch/pytorch/issues/134242 Make sure to never raise an error when confused. Logs for confusion can be enabled with `TORCH_LOGS="torch.utils.module_tracker"` or the usual python systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134467 Approved by: https://github.com/malfet	2024-08-26 19:39:08 +00:00
Mengwei Liu	a0e062c6f1	Add mean.dtype_out (#133506 ) Give it a try and see if CI is happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506 Approved by: https://github.com/bdhirsh	2024-08-26 19:26:11 +00:00
eqy	3541e450af	Support larger page sizes with `use_mmap_weights` (#131000 ) Fixes e.g., `test_large_mmaped_weights_non_abi_compatible_cuda` on machines with 64K page size CC @malfet @tinglvv @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/131000 Approved by: https://github.com/malfet	2024-08-26 18:35:55 +00:00
Henry Tsang	3322ee236d	[aoti] remove c_shim_version v1 logic (#134283 ) Summary: Previously, https://github.com/pytorch/pytorch/pull/132750 and https://github.com/pytorch/pytorch/pull/133105 set c_shim_version to 2 for all cases. So removing c_shim_version logic. Test Plan: ci Differential Revision: D61574695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134283 Approved by: https://github.com/desertfire	2024-08-26 18:29:40 +00:00
Wuxun Zhang	1d231ff8ba	[HOO] add hints_wrapper to support passing context hints (#132860 ) Fixes #126393 The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842). Hints are passed as kwargs of hints_wrapper op. It also supports nested hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-08-26 18:21:22 +00:00
Animesh Jain	1ccc8f0200	[dynamo][super] Improve handling of getattr on super (#134039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-08-26 18:20:39 +00:00
Xu Han	1dd4b9221b	[inductor] enable clang for Windows inductor (#134444 ) Changes: 1. Add Windows clang-cl compiler check. 2. Add openmp config for clang-cl. 3. Preload libomp.dll when use clang. 4. Add compiler flags syntax check for `clang` and `clang++`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134444 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet	2024-08-26 18:19:59 +00:00
Xu Han	0a3c064c12	[inductor] fix _maybe_subprocess_run not support Windows path (#134365 ) Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux. Reproduce UTs: ```cmd pytest test\dynamo\test_minifier.py -v -k test_after_dynamo_cpu_accuracy_error ``` Error message: ```cmd ____________________________________________________________________________________________________________ MinifierTests.test_after_dynamo_cpu_accuracy_error _____________________________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 40, in test_after_dynamo_cpu_accuracy_error self._test_after_dynamo( File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 27, in _test_after_dynamo self._run_full_test(run_code, "dynamo", expected_error, isolate=False) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 235, in _run_full_test self.assertIn(expected_error, test_proc.stderr.decode("utf-8")) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1112, in assertIn self.fail(self._formatMessage(msg, standardMsg)) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail raise self.failureException(msg) AssertionError: 'AccuracyError' not found in 'Traceback (most recent call last):\n File "C:\\Users\\Xuhan\\.conda\\envs\\win_mkl_static\\lib\\site-packages\\torch\\_dynamo\\test_minifier_common.py", line 114, in _maybe_subprocess_run\n exec(code, {"__name__": "__main__", "__compile_source__": code})\n File "<string>", line 9\n torch._dynamo.config.debug_dir_root = "C:\\Users\\Xuhan\\AppData\\Local\\Temp\\tmpufu9t3pc"\n ^\nSyntaxError: (unicode error) \'unicodeescape\' codec can\'t decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n' To execute this test, run the following from the base repo dir: python test\dynamo\test_minifier.py MinifierTests.test_after_dynamo_cpu_accuracy_error This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------- test stdout: test stderr: Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 114, in _maybe_subprocess_run exec(code, {"__name__": "__main__", "__compile_source__": code}) File "<string>", line 9 torch._dynamo.config.debug_dir_root = "C:\Users\Xuhan\AppData\Local\Temp\tmpufu9t3pc" ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- running test ``` Local test passed: <img width="849" alt="image" src="https://github.com/user-attachments/assets/4a4eecc2-7c08-4de6-9395-546b69803b16"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134365 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-08-26 17:48:11 +00:00
atalman	78128cbdd8	[CD] Use ephemeral arm64 runners for nightly and docker builds (#134473 ) Follow up after adding linux arm64 ephemeral instances: https://github.com/pytorch/pytorch/pull/134469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134473 Approved by: https://github.com/malfet	2024-08-26 17:47:20 +00:00
Xu Han	0f5b052dba	[inductor] calibration inductor windows uts (11/N) (#134427 ) enable Windows inductor UTs of `test/inductor/test_inductor_freezing.py` Local test pass: <img width="1891" alt="image" src="https://github.com/user-attachments/assets/f3a873b4-abb5-4047-92f8-8e6da7c67315"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134427 Approved by: https://github.com/jansel	2024-08-26 17:43:58 +00:00
cyy	73604eed0c	[20/N] Fix clang-tidy warnings in jit (#133399 ) Follows #133067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133399 Approved by: https://github.com/Skylion007	2024-08-26 17:43:52 +00:00
Xu Han	019b80855f	[inductor] calibration inductor windows uts (10/N) (#134426 ) enable Windows inductor UT of `test/inductor/test_efficient_conv_bn_eval.py` Local test pass: <img width="1892" alt="image" src="https://github.com/user-attachments/assets/8a94c5e4-68bf-4a6f-8a1b-60d6ede14882"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134426 Approved by: https://github.com/jansel	2024-08-26 17:43:36 +00:00
Xu Han	7ff576072f	[inductor] calibration inductor windows uts (8/N) (#134424 ) enable Windows inductor UTs of `test/inductor/test_benchmark_fusion.py` Local test pass: <img width="1912" alt="image" src="https://github.com/user-attachments/assets/5be34b0c-9411-4430-927e-3313245f7c13"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134424 Approved by: https://github.com/ezyang	2024-08-26 17:38:53 +00:00
PyTorch MergeBot	adcce538b7	Revert "Allow mp.start_processes to create processes in parallel (#133707 )" This reverts commit 3546628a2a167ace6060737eeccf8ee8fd87ddc0. Reverted https://github.com/pytorch/pytorch/pull/133707 on behalf of https://github.com/ZainRizvi due to sorry but trunk has been consistently broken since this PR was merged. See: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10529617600/job/29191757055) [HUD commit link](`3546628a2a`) ([comment](https://github.com/pytorch/pytorch/pull/133707#issuecomment-2310709523))	2024-08-26 17:31:10 +00:00
mori360	d0ac5d55ba	Memory optimization for DSD for TorchTune LoRA (#134025 ) Optimize memory cost at [PR#129635](https://github.com/pytorch/pytorch/pull/129635) There are 2 main part of the optimization here: 1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case. 2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part. Future work: Memory optimization to the opt will be conducted in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/134025 Approved by: https://github.com/fegin Co-authored-by: Rachel Guo <guorachel@meta.com>	2024-08-26 17:24:25 +00:00
Catherine Lee	fc61aae70f	Remove color in CI (#133517 ) Remove color by default to make CI logs easier to read Example of color <img width="569" alt="image" src="https://github.com/user-attachments/assets/0da13544-98b1-47be-8383-64a5b3fd8951"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133517 Approved by: https://github.com/ZainRizvi	2024-08-26 16:58:06 +00:00
PyTorch MergeBot	42955e04f1	Revert "[dynamo] Cache _dynamo.disable results (#134272 )" This reverts commit a699bd11551e9755bb9238c6b82c369880789397. Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))	2024-08-26 16:57:53 +00:00
PyTorch MergeBot	e94bdc7876	Revert "[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 )" This reverts commit cdb9df5efe78142b7a612ae9c938ddf8a8850d10. Reverted https://github.com/pytorch/pytorch/pull/134354 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))	2024-08-26 16:57:53 +00:00
atalman	a6fac0e969	Use ephemeral runners for windows nightly builds (#134463 ) This is definition of windows.4xlarge: ``` windows.4xlarge: disk_size: 256 instance_type: c5d.4xlarge is_ephemeral: true max_available: 420 os: windows ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463 Approved by: https://github.com/jeanschmidt	2024-08-26 16:33:19 +00:00
Wang, Chuanqi	b417e32da2	[CD] fix xpu nightly wheel test env (#134395 ) (#134464 ) Due to the https://github.com/pytorch/builder/pull/1972 landed, it will source xpu env duplicated in nightly wheel test. Works for https://github.com/pytorch/pytorch/issues/114850 Realnd of #134395 to be landed with pytorchmergebot Pull Request resolved: https://github.com/pytorch/pytorch/pull/134464 Approved by: https://github.com/jeanschmidt Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>	2024-08-26 15:35:48 +00:00
atalman	c507f402f1	Add linux arm64 ephemeral runners (#134469 ) Should be landed with: https://github.com/pytorch/test-infra/pull/5593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134469 Approved by: https://github.com/jeanschmidt, https://github.com/clee2000	2024-08-26 15:32:45 +00:00
PyTorch MergeBot	17e8a51ff2	Revert "[inductor]Let output or input_as_strided match exact strides (#130956 )" This reverts commit a63efee5cd422db0aabe5d02d2fe35fef9be7978. Reverted https://github.com/pytorch/pytorch/pull/130956 on behalf of https://github.com/ZainRizvi due to sorry but this seems to cause internal tests to fail. Please see D61771533 for details ([comment](https://github.com/pytorch/pytorch/pull/130956#issuecomment-2310490049))	2024-08-26 15:31:23 +00:00
PyTorch MergeBot	1c4780e69a	Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )" This reverts commit 4c28a0eb0ba437c1b7db559f63f8bec17bd48f69. Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/ZainRizvi due to Sorry but this causes formatting errors internally which make it fail to build. See D61759282 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2310455878))	2024-08-26 15:19:27 +00:00
PyTorch MergeBot	50e90d7203	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit 6c0b15e3828b8e2a0bd726a3e5d4e98c8ced5efe. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	472c7cf962	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 8d90392fb02ce5e6854e6b4dbcdc4a7bbd55f8e2. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	3d7f3f6a55	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 0e49b2f18e78386c8ed9ce540a8017411c7ab0cd. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	e1fc4362fb	Revert "[dynamo] simplify implementation for `os.fspath` (#133801 )" This reverts commit c5f6b72041144c00e240bcfdc783a5597c3d8928. Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
Thanh Ha	bb67ff2ba7	Migrate Windows bin jobs to runner determinator (#134231 ) Update Windows binary workflows to use the runner determinator script. Closes: pytorch/ci-infra#262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134231 Approved by: https://github.com/ZainRizvi	2024-08-26 14:56:00 +00:00
Benjamin Glass	27d97b9649	Remove unnecessary test skip (#134250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 14:34:53 +00:00
Andrey Talman	be96ccf77c	Revert "[CD] fix xpu nightly wheel test env (#134395 )" (#134461 ) This reverts commit 96738c9d756fbd64e6f2eba67f711d3e18f1630c. Merged without pytorchmergebot command by mistake Pull Request resolved: https://github.com/pytorch/pytorch/pull/134461 Approved by: https://github.com/jeanschmidt	2024-08-26 13:40:17 +00:00
Wang, Chuanqi	96738c9d75	[CD] fix xpu nightly wheel test env (#134395 )	2024-08-26 08:53:15 -04:00
haozhe.zhu	1ff226d88c	[inductor] support vec for atomic add (#131314 ) Depends on https://github.com/pytorch/pytorch/pull/130827 to have correct `index_expr` dtype Support vec for atomic add by scalar implementation. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_scatter_using_atomic_add_vec ``` Generated code for `test_scatter_using_atomic_add_vec` ``` cpp_fused_scatter_0 = async_compile.cpp_pybinding(['const float', 'const int64_t', 'const float', 'float'], ''' #include "/tmp/torchinductor_root/nn/cnnpkaxivwaa5rzng6qsyc4ao42vschogi3yk33ukwv3emlvxeqq.h" extern "C" void kernel(const float* in_ptr0, const int64_t* in_ptr1, const float* in_ptr2, float* out_ptr0) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16); tmp0.store(out_ptr0 + static_cast<long>(x0)); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(16L); x0<static_cast<long>(25L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; out_ptr0[static_cast<long>(x0)] = tmp0; } } { for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr1 + static_cast<long>(x0), 16); auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x0), 16); auto tmp1 = 25L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = at::vec::VectorizedN<int64_t,2>(tmp2); auto tmp4 = tmp0 + tmp3; auto tmp5 = static_cast<int64_t>(0); auto tmp6 = at::vec::VectorizedN<int64_t,2>(tmp5); auto tmp7 = at::vec::VecMask<int64_t,2>(tmp0 < tmp6); auto tmp8 = decltype(tmp4)::blendv(tmp0, tmp4, tmp7.template cast<int64_t,2>()); auto tmp9 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp8.store(tmpbuf.data()); return tmpbuf; } () ; auto tmp10 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; #pragma GCC unroll 16 for (long x0_inner = 0; x0_inner < 16; x0_inner++) { tmpbuf[x0_inner] = static_cast<long>(tmp9[x0_inner]); } return at::vec::VectorizedN<int64_t,2>::loadu(tmpbuf.data(), 16); } () ; TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp10) & (tmp10 < at::vec::VectorizedN<int64_t,2>(25L)))).all_masked(), "index out of bounds: 0 <= tmp10 < 25L"); atomic_add_vec(out_ptr0, tmp8, tmp12); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(16L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x0)]; auto tmp9 = in_ptr2[static_cast<long>(x0)]; auto tmp1 = 25L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = tmp5; auto tmp7 = c10::convert<int64_t>(tmp6); TORCH_CHECK((0 <= tmp7) & (tmp7 < 25L), "index out of bounds: 0 <= tmp7 < 25L"); atomic_add(&out_ptr0[static_cast<long>(tmp5)], static_cast<float>(tmp9)); } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131314 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-08-26 10:36:51 +00:00
fduwjj	bf5c7bf06d	[FR] Fix the bug in FR script (e.g., checking all ranks dump check) (#134383 ) We somehow convert the rank to string which makes the ranks check fail. This fix now convert them all to int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134383 Approved by: https://github.com/c-p-i-o	2024-08-26 08:21:14 +00:00
Avik Chaudhuri	92c4771853	fix stuck floordiv (#134150 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134133 Test Plan: Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds: 10 127268319 20 220839662 30 325463125 40 429259441 50 553136055 60 670799769 70 999170514 80 899014103 90 997168902 100 1168202035 110 1388556619 120 1457488235 130 1609816470 140 2177889877 150 1917560313 160 2121096113 170 2428502334 180 4117450755 190 4003068224 So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min. Didn't add a perf test because ezyang is planning to build a benchmark. Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point. Differential Revision: D61619660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150 Approved by: https://github.com/ezyang	2024-08-26 07:27:59 +00:00
Xuehai Pan	c5f6b72041	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305 ghstack dependencies: #133769, #133778, #133779, #133771	2024-08-26 07:12:15 +00:00
Amadeusz Skrzypczak	38f97ec8e3	[pt2] Add meta for poisson (#134103 ) Because aten.poisson doesn't have meta function registered, there is one additional eager execution of this op during compilation phase of torch.compile. There are more ops without meta registration. Is there any reason for it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134103 Approved by: https://github.com/ezyang	2024-08-26 06:14:38 +00:00
Aaron Orenstein	ed86ac2f25	[BE] typing for decorators - fx/_compatibility (#134054 ) Summary: See #131429 Test Plan: unit tests pass Differential Revision: D61493706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134054 Approved by: https://github.com/oulgen	2024-08-26 04:00:27 +00:00
Laith Sakka	7b6b10417d	Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248 ) I had a night mare rewriting tests in test_misc.py specifically : 1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments. 2. empty lines added when EXPECTTEST_ACCEPT=1 are changed with linter causing tests to fail or linter fail! add flag to ignore empty lines. 3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045 this is used in Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248 Approved by: https://github.com/aorenste ghstack dependencies: #133639, #134364	2024-08-26 02:03:44 +00:00
Xu Han	2ec149cd3e	[inductor] fix test_functional_call_sequential_params_and_buffers expectation on Windows (#134394 ) This UT actual code only one empty line wrap difference(`linear` and `add`) between Windows/Linux, and the context is right. Reproduce UTs: ```cmd pytest test\dynamo\test_higher_order_ops.py -v -k test_functional_call_sequential_params_and_buffers ``` We can add `empty_line_normalizer` to fix it. ```cmd ______________________________________________________________________________________________ FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers _______________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 3676, in test_functional_call_sequential_params_and_buffers self.assertExpectedInline( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2871, in assertExpectedInline return super().assertExpectedInline(actual if isinstance(actual, str) else str(actual), expect, skip + 1) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 271, in assertExpectedInline self.assertMultiLineEqualMaybeCppStack(expect, actual, msg=help_text) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 292, in assertMultiLineEqualMaybeCppStack self.assertMultiLineEqual(expect, actual, args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1226, in assertMultiLineEqual self.fail(self._formatMessage(msg, standardMsg)) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail raise self.failureException(msg) AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n' class GraphModule(torch.nn.Module): def forward(self, L_params_l1_weight_: "f32[1, 1]", L_params_l1_bias_: "f32[1]", L_buffers_buffer_: "f32[1]", L_inputs_: "f32[1, 1]"): l_params_l1_weight_ = L_params_l1_weight_ l_params_l1_bias_ = L_params_l1_bias_ l_buffers_buffer_ = L_buffers_buffer_ l_inputs_ = L_inputs_ linear: "f32[1, 1]" = torch._C._nn.linear(l_inputs_, l_params_l1_weight_, l_params_l1_bias_); l_inputs_ = l_params_l1_weight_ = l_params_l1_bias_ = None + <<<< (difference is here ) add: "f32[1, 1]" = linear + l_buffers_buffer_; linear = l_buffers_buffer_ = None return (add,) : To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this) To execute this test, run the following from the base repo dir: python test\dynamo\test_higher_order_ops.py FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.4275s] test/dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers - AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134394 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2024-08-26 01:41:20 +00:00
Tianyi Tao	7af38eb98b	Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307 ) Fixes #128264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130307 Approved by: https://github.com/soulitzer	2024-08-25 22:14:02 +00:00
Xu Han	dc1959e6a7	[inductor] calibration inductor windows uts (7/N) (#134420 ) Disable UTs on Windows: `test/dynamo/test_misc.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134420 Approved by: https://github.com/jansel	2024-08-25 20:39:54 +00:00
Xu Han	97fd087cdb	[inductor] calibration inductor windows uts (6/N) (#134419 ) Disable UTs for Windows: `test/dynamo/test_aot_autograd_cache.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134419 Approved by: https://github.com/jansel	2024-08-25 20:39:34 +00:00
Richard Barnes	b5dd60fa75	Fix namespace issues with qnnpack (#134336 ) After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added. Test Plan: Sandcastle Differential Revision: D61679037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336 Approved by: https://github.com/Skylion007	2024-08-25 19:50:01 +00:00
Igor Sugak	7940f2428f	[torch/package_importer] add compatibility name mapping (#134376 ) Summary: This enables patching extern modules to provide compatibility with serialized code depending on different versions of those extern modules. The main motivation is to enable Numpy upgrade. In the recent release many alias to builtin types were deprecated and removed [1]. This breaks loading pickled modules that reference the removed aliases. While the proper solution is to re-generate pickled modules, it's not always feasible. This proposes a way to define mapping with a new type, for a module member. It is only set if it's not present in the loaded module, thus removes the need to check for exact versions. https://numpy.org/doc/stable/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated Differential Revision: D61556888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134376 Approved by: https://github.com/SherlockNoMad	2024-08-25 19:34:46 +00:00
Shivam Raikundalia	816061843a	[Distributed/Profiler] Fix input/output dimension overflow (#134360 ) Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t` Test Plan: Run HTA with new outputted values and make sure overflow does not occur Reviewed By: fengxizhou Differential Revision: D61728747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360 Approved by: https://github.com/fengxizhou, https://github.com/jeanschmidt	2024-08-25 16:25:56 +00:00
eqy	e93ca12c88	[CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031 ) Fixes #134001 Incorrect assumption that two same-shape tensors being contiguous meant that they would have the same stride Pull Request resolved: https://github.com/pytorch/pytorch/pull/134031 Approved by: https://github.com/drisspg, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-08-25 14:31:30 +00:00
Chirag Pandya	08d111250a	[ez][c10d] change ERROR to WARNING (#134349 ) Summary: Change error to warning because TCPStore can be torn down during a normal shutdown. It's OK if we're unable to access TCPStore. Should not be an error. Test Plan: Ran locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/134349 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-08-25 14:22:55 +00:00
PyTorch MergeBot	4648848696	Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 )" This reverts commit f71c3d265ab52589f983dd252d61461db4e7dbbd. Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/jeanschmidt due to seems to have introduced breakages in linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2308787310))	2024-08-25 11:20:30 +00:00
PyTorch MergeBot	e5563f7ad7	Revert "[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294 )" This reverts commit eb15b1a016c6facaf8605dde2c20b5de1586542d. Reverted https://github.com/pytorch/pytorch/pull/134294 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658 ([comment](https://github.com/pytorch/pytorch/pull/134294#issuecomment-2308785949))	2024-08-25 11:16:04 +00:00
wz337	268092db83	[DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048 ) If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim. Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_". For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048 Approved by: https://github.com/fegin ghstack dependencies: #133838, #133839	2024-08-25 10:36:01 +00:00
Edward Z. Yang	326db8af4c	Replace sympy Min/Max with reimplementations (#133319 ) Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile: ![image](https://github.com/user-attachments/assets/142301e9-3a18-4370-b9db-19b32ece7ee8) On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace. The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation. The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments: * I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning * There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change) * I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is. Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319 Approved by: https://github.com/Skylion007	2024-08-25 05:05:59 +00:00
Avik Chaudhuri	8db8ac700d	line by line logging (#134298 ) Summary: Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup. This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS` enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.: ``` TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ... ``` This will show logs with something like: ``` ... prim::device called at .../example.py:4284 in foo TensorBase.item called at .../example.py:4277 in bar ... ``` We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late. Test Plan: ran it on some sample commands Differential Revision: D61692156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298 Approved by: https://github.com/angelayi	2024-08-25 02:57:11 +00:00
Xu Han	907c32faac	[inductor] calibration inductor windows uts (4/N) (#134401 ) skip failed UTs of `test/dynamo/test_unspec.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134401 Approved by: https://github.com/ezyang	2024-08-25 00:32:29 +00:00
Xu Han	74ef74be36	[inductor] calibration inductor windows uts (3/N) (#134400 ) skip Windows UT of `test/dynamo/test_trace_rules.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134400 Approved by: https://github.com/ezyang	2024-08-24 23:48:50 +00:00
Shivam Raikundalia	d33d68e326	[Profiler] Add test to make sure FunctionEvents are processed lazily (#134359 ) Summary: Create simple test that checks that FunctionEvent build tree happens lazily by checking that the metrics for it changes before and after call. Test Plan: Make sure test passes in CI Reviewed By: briancoutinho Differential Revision: D61685429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134359 Approved by: https://github.com/briancoutinho	2024-08-24 23:03:19 +00:00
Xu Han	af4c87953e	[inductor] calibration inductor windows uts (5/N) (#134402 ) skip UTs of `test/dynamo/test_repros.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134402 Approved by: https://github.com/ezyang	2024-08-24 23:00:11 +00:00
Bob Ren	94f92fbd88	Use integer divison in arange length calculation when start/end/step are integral (#134296 ) Fixes #133338 Test Plan: ``` TORCH_LOGS=dynamic python import torch torch._dynamo.config.capture_scalar_outputs = True @torch.compile() def f(x): y = x.item() torch._check_is_size(y) r = torch.arange(y, dtype=torch.float32) torch._check(r.size(0) == y) return r f(torch.tensor([300])) ``` Before and after diff. Verify the following line ``` I0813 11:05:44.890000 652898 torch/fx/experimental/symbolic_shapes.py:5198] [0/0] runtime_assert Eq(CeilToInt(IntTrueDiv(u0, 1)), u0) [guard added] at aa.py:10 in f (_dynamo/utils.py:2092 in run_node), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(CeilToInt(IntTrueDiv(u0, 1)), u0)" ``` no longer shows in the logs. Also verify CI passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134296 Approved by: https://github.com/aorenste	2024-08-24 21:09:28 +00:00
Aart Bik	1a0d00f1f4	[traced-graph][sparse] enable to_dense() for compressed (#133371 ) Fixes https://github.com/pytorch/pytorch/issues/133174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133371 Approved by: https://github.com/ezyang	2024-08-24 20:33:23 +00:00
Aart Bik	050aa67e41	[traced-graph][sparse] fix restrictive assert for sparse add (#134037 ) exporting sparse addition can be CPU/Meta this fixes the overly restrictive assert and adds an exporting test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134037 Approved by: https://github.com/ezyang	2024-08-24 20:26:47 +00:00
Xu Han	90fb83749e	[inductor] fix test torch package working with trace on windows (#134397 ) Current temporary directory path is hard code. Fixed by get temporary directory path by API. Reproduce UTs: ```cmd python test/dynamo/test_dynamic_shapes.py -v -k test_torch_package_working_with_trace_dynamic_shapes ``` Error message: ```cmd ________________________________________________________________________________________________ DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes ________________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_misc.py", line 7199, in test_torch_package_working_with_trace with package.PackageExporter(path) as exp: File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\package\package_exporter.py", line 237, in __init__ self.zip_file = torch._C.PyTorchFileWriter(f) RuntimeError: Parent directory /tmp does not exist. To execute this test, run the following from the base repo dir: python test\dynamo\test_dynamic_shapes.py DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.0080s] test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes - RuntimeError: Parent directory /tmp does not exist. ==================================================================================================================== 1 failed, 1665 deselected in 4.00s ===================================================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134397 Approved by: https://github.com/ezyang	2024-08-24 20:25:44 +00:00
Jonathan Deakin	9cd53b3212	Add Arm copyright line to LICENSE (#133982 ) Some historical commits from arm: - 2021 664126bab5f3f2a275e82b7bde127132cff7f34e - 2023 2630144786e906b40abbe017294d404bcfe3c6ae - 2024 ce6130014156fa9555ce3d16c5f9a84cbdadf8f4 See https://github.com/pytorch/pytorch/pull/126687 for initial discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133982 Approved by: https://github.com/malfet	2024-08-24 18:41:06 +00:00
Jonathan Deakin	50d5aa8c10	Enable optimized dynamic quantization on aarch64 (#126687 ) oneDNN+ACL has optimized kernels for s8s8 matmul, so input is signed. This change leaves behaviour on all other platforms the same. This change requires https://github.com/intel/ideep/pull/313 to go in, and oneDNN 3.5 for the optimized kernels. This change speeds up dynamic quantized linear by ~10x. Also, do you have a policy on copyright headers? Arm's usual policy when contributing to open source projects is to include a copyright header on any file which is modified. Would this be acceptable? If not, is there somewhere else suitable to note copyright? Pull Request resolved: https://github.com/pytorch/pytorch/pull/126687 Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/snadampal Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-24 18:40:12 +00:00
Jack Taylor	f71c3d265a	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-08-24 18:26:49 +00:00
chuanqiw	6245d5b87b	[CI] Update XPU ci test python version to 3.9 (#134214 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134214 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-08-24 18:11:36 +00:00
Yueming Hao	a63efee5cd	[inductor]Let output or input_as_strided match exact strides (#130956 ) Fixes #130394 TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables non-dense outputs' strides follow the strides required by semantics. The comparison between the original and after this fix for the test is the below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 128 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 8 x1 = (xindex // 8) - x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (16x1)), xmask) tmp1 = tmp0 + tmp0 - tl.store(out_ptr0 + (x2), tmp1, xmask) + tl.store(out_ptr0 + (x0 + (16x1)), tmp1, xmask) def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (16, 8), (16, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) - buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32) + buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32) stream0 = get_raw_stream(0) triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0) del arg0_1 return (buf1, ) ``` The buf1 is created with exact stride required by users, and its values are written in same stride with the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2024-08-24 17:04:05 +00:00
Animesh Jain	cdb9df5efe	[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 ) Hard to write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354 Approved by: https://github.com/jansel ghstack dependencies: #134272	2024-08-24 15:17:56 +00:00
David Berard	d433a603af	[BE] use torch.amp.autocast instead of torch.cuda.amp.autocast (#134291 ) torch.cuda.amp.autocast / torch.cpu.amp.autocast are deprecated and spew a ton of warnings when these tests run. This PR: Update to just use torch.amp.autocast(device). Note: this uncovers a bug in the test: when `device` is CUDA, it actually shows up as "cuda:0" - so previously, this test was _always_ using `torch.cpu.amp.autocast` even for `cuda` device. This PR fixes this, and uncovers additional bugs in `pinverse` and `linalg.pinv`; `linalg.pinv` was already failing before on CPU, but now the test also catches failures on CUDA, (and this PR adds to the skipped-test list). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134291 Approved by: https://github.com/YuqingJ	2024-08-24 15:07:49 +00:00
Huanyu He	a1061009c9	[PT2] use statically_known_true in slice_noop (#134270 ) Summary: # context * when fixing the graph break in _maybe_compute_kjt_to_jt_dict, we encountered this issue P1539489731: ``` [rank0]: ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. [rank0]: Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. [rank0]: [rank0]: Potential framework code culprit (scroll up for full backtrace): [rank0]: File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_inductor/fx_passes/post_grad.py", line 671, in slice_noop [rank0]: if start == 0 and end >= 2*63 - 1 and step == 1: ``` change the condition logic to be compatible with SymInt Test Plan: # commands * run test ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 \| tee -a `date +"%Y.%m.%d.%H.%M"`.`sl whereami`.log ``` * tlparse ``` ls -thl /var/tmp/tt \| head -9 && tlparse `ls -t /var/tmp/tt/* \| head -1` ``` Differential Revision: D61677207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134270 Approved by: https://github.com/ezyang	2024-08-24 13:58:51 +00:00
atalman	ff77c67d16	Use ephemeral runners for linux nightly builds (#134367 ) Should be landed with https://github.com/pytorch/test-infra/pull/5590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134367 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere	2024-08-24 12:49:07 +00:00
Simon Fan	ff7d94c67e	[compiled autograd] fix saved tensor hook firing count (#134361 ) SavedVariable constructor calls the pack hooks, we don't want to call them for the proxy tensor since it is proxying a tensor that already had called the pack hook during forward. Using the same fix as https://github.com/pytorch/pytorch/pull/123196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134361 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162, #134163	2024-08-24 12:06:36 +00:00
Simon Fan	929de1d0d4	Re-enable skipped compiled autograd eager tests (#134163 ) Originally disabled in: https://github.com/pytorch/pytorch/pull/131700#discussion_r1727153445, but the failure is no longer in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/134163 Approved by: https://github.com/soulitzer ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162	2024-08-24 12:06:36 +00:00
Simon Fan	ad8bdfae1e	add compiled_autograd to programmatic set_logs API (#134162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134162 Approved by: https://github.com/yf225, https://github.com/jansel ghstack dependencies: #134186, #134200, #134205, #134286, #134290	2024-08-24 12:06:36 +00:00
Simon Fan	1431663693	[compiled autograd] finish classifying tests (#134290 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134290 Approved by: https://github.com/yf225 ghstack dependencies: #134186, #134200, #134205, #134286	2024-08-24 12:06:36 +00:00
Simon Fan	0b228a2af8	[compiled autograd] match eager behavior for ctx.saved_variables (#134286 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134286 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200, #134205	2024-08-24 12:06:36 +00:00
Simon Fan	6cc57c64b2	[compiled autograd] match eager behavior for post acc grad hooks (#134205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134205 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200	2024-08-24 12:06:36 +00:00
Simon Fan	d7a25e1d8c	[compiled autograd] add config patching for certain eager tests (#134200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134200 Approved by: https://github.com/jansel ghstack dependencies: #134186	2024-08-24 12:06:36 +00:00
Simon Fan	0d9208a398	[compiled autograd] match eager behavior for inplace detached activations (#134186 ) Fixes `TestAutograd.test_saved_variable_saved_original_inplace_detach` when ran under compiled autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/134186 Approved by: https://github.com/jansel	2024-08-24 12:06:36 +00:00
Huamin Li	ccafc93be5	[AOTI][CPU] Make int8 qlinear work (#134368 ) Summary: This diff will decompose torch.ops._quantized.wrapped_quantized_linear into torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked for AOTI, and added the corresponding impl into shim The way it works will be similar to what we did previously for fbgemm fp16 dynamic qlinear. We will do constant folding for packed weight during runtime (warm up) to achieve the speed up Reviewed By: desertfire Differential Revision: D61396144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134368 Approved by: https://github.com/houseroad	2024-08-24 08:25:25 +00:00
Xilun Wu	eb15b1a016	[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294 ) Summary Before this PR, `sharding propagator` is shared among threads. The result is the cache result of rank 0 would be accessible by other ranks e.g. rank 1 and this could lead to wrong DTensor resharding. This PR fixes it by making the cache a local variable at thread level, and it fixes `dstack` test (#126493), `inner` (https://github.com/pytorch/pytorch/issues/126852), and `vstack` (https://github.com/pytorch/pytorch/issues/126868). It also fixes `poisson_nll` (https://github.com/pytorch/pytorch/issues/131446) as a bi-product. Test `pytest test/distributed/_tensor/test_dtensor_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134294 Approved by: https://github.com/wz337, https://github.com/awgu	2024-08-24 05:56:45 +00:00
Xu Han	1034f456ef	[inductor] fix munge_exc not support windows path (#134348 ) Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux. Reproduce UT: ```cmd pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail ``` Error msg: ```cmd ________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________ Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn fn(self, records) File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail munge_exc(record.getMessage()), File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc s = re.sub(file, os.path.basename(file), s) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub return _compile(pattern, flags).sub(repl, string, count) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile p = sre_parse.parse(p, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse code = _escape(source, this, state) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape raise source.error("incomplete escape %s" % escape, len(escape)) re.error: incomplete escape \x at position 2 To execute this test, run the following from the base repo dir: python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------- frames [('total', 2), ('ok', 2)] inductor [] inline_call [] stats [('calls_captured', 38), ('unique_graphs', 2)] --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699 V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] triggered by the following guard failure(s): V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')]) # _dynamo\output_graph.py:479 in init_ambient_guards ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2 ``` Local test passed: <img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348 Approved by: https://github.com/jansel	2024-08-24 05:51:35 +00:00
Shangdi Yu	0694918aeb	[export] Temporarily bypass torch_fn in partitioner (#134292 ) Summary: "torch_fn" is not correct for the decomposed add node from batch norm. This is a temporary workaround to bypass torch fn. For example, for the graph below (test_qat_conv2d_unary graph): ``` graph(): %conv_weight : [num_users=1] = get_attr[target=conv.weight] %bn_weight : [num_users=1] = get_attr[target=bn.weight] %bn_bias : [num_users=1] = get_attr[target=bn.bias] %bn_running_mean : [num_users=1] = get_attr[target=bn.running_mean] %bn_running_var : [num_users=1] = get_attr[target=bn.running_var] %bn_num_batches_tracked : [num_users=1] = get_attr[target=bn.num_batches_tracked] %x : [num_users=1] = placeholder[target=x] %conv2d : [num_users=1] = call_function[target=torch.ops.aten.conv2d.default](args = (%x, %conv_weight, None, [1, 1], [1, 1]), kwargs = {}) %add_ : [num_users=0] = call_function[target=torch.ops.aten.add_.Tensor](args = (%bn_num_batches_tracked, 1), kwargs = {}) %batch_norm : [num_users=1] = call_function[target=torch.ops.aten.batch_norm.default](args = (%conv2d, %bn_weight, %bn_bias, %bn_running_mean, %bn_running_var, True, 0.1, 1e-05, True), kwargs = {}) %relu : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%batch_norm,), kwargs = {}) %max_pool2d : [num_users=1] = call_function[target=torch.ops.aten.max_pool2d.default](args = (%relu, [3, 3], [3, 3]), kwargs = {}) return (max_pool2d,) ``` the add_ node has `'torch_fn': ('add__1', 'method_descriptor.add_'),` in its meta. If we run the line below in `_annotate_qat_conv2d_bn_binary_unary`, we'll have a partition without output nodes. ``` find_sequential_partitions( gm, [torch.nn.Conv2d, torch.nn.BatchNorm2d, operator.add, torch.nn.ReLU] ) ```` ``` partition_list [ SourcePartition(nodes=[conv_weight, conv2d], source=<class 'torch.nn.modules.conv.Conv2d'>, input_nodes=[x], output_nodes=[conv2d], params=[conv_weight]), SourcePartition(nodes=[bn_weight, bn_bias, bn_running_mean, bn_running_var, bn_num_batches_tracked, add_, batch_norm], source=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, input_nodes=[conv2d], output_nodes=[batch_norm], params=[bn_num_batches_tracked, bn_running_var, bn_bias, bn_weight, bn_running_mean]), SourcePartition(nodes=[add_], source='add_', input_nodes=[bn_num_batches_tracked], output_nodes=[], params=[]) ] ``` We should not have the last partition. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv2d ``` Differential Revision: D61569049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134292 Approved by: https://github.com/angelayi	2024-08-24 05:50:18 +00:00
Daniel Dale	f260cc2edf	Enable DTensor sharding propagation of `native_layer_norm_backward` to more fully accommodate optional args (#133502 ) Fixes #133499 ### The issue Testing a variety of TP `requires_grad` patterns (validating maximally flexible finetuning) revealed `DTensor` sharding propagation of `aten.native_layer_norm_backward` (default) fails with an `IndexError` for certain `requires_grad` patterns (pattern 1) (e.g. `output_mask` `[True, False, False]`) and an `AssertionError` for others (pattern 2) (e.g. output mask `[False, True, ]`). Please see issue #133499 for a full description of the observed failure patterns along with reproduction. ### Use Cases and Remediation Failure pattern 1 is potentially problematic for a variety of finetuning scenarios. Though failure pattern 2 is really an xfail right now since it's not fully supported, IMHO there are use cases (e.g. especially wrt to mechanistic interpretability research, but certain finetuning scenarios too potentially) that justify supporting this output mask (especially since supporting it is fairly straightforward I think). In this PR I propose some modest changes that: Address the aforementioned failure modes. * Add a couple tests that I'm hopeful will help ensure `DTenso`r op dispatch (which is so well implemented and such a pleasure working with btw! 🚀 🎉) accommodates a wide variety of (potentially unanticipated) `requires_grad` patterns as it evolves. To address both failure modes, I'm proposing the following changes: 1. To [`torch.distributed._tensor.ops._math_ops.layer_norm_bwd_strategy`](`7b269cc484/torch/distributed/_tensor/ops/_math_ops.py (L873)`): - Refactor conditional `output_mask` handling such that the input and output specs in the`PlacementStrategy`s of the returned `output_strategy.strategies` list remain aligned with the `op_schema.args_spec` (whose definition does not change at runtime based upon unused optional args). 2. To [`torch.distributed._tensor._sharding_prop.propagate_op_sharding_non_cached`](`7b269cc484/torch/distributed/_tensor/_sharding_prop.py (L256-L262)`): - When iterating through the active `op_schema.args_spec` to build the relevant `expected_input_specs` list, filter any `None` `desired_specs`. 3. To [`torch/distributed/_tensor/_op_schema.OpSchema._inplace_rewrap_schema_suggestion`](`7b269cc484/torch/distributed/_tensor/_op_schema.py (L418)`) - When inputs need a redistribute, for runtime-unrequired (`None` arguments in the aligned `suggestion_args_schema`), ignore the associated `suggestion_args_spec` ### Implementation considerations: - Regarding `1`, to avoid changing the op strategy return args ([`op_strategy`](`cf81180007/torch/distributed/_tensor/_sharding_prop.py (L234)`)), the change in `1` allows `None` elements to exist temporarily in `PlacementStrategy.input_specs` (treating it as `Sequence[DTensorSpec \| None] \| None` when it's `Sequence[DTensorSpec] \| None`. This could be addressed in any number of ways but I thought it best to leave that for a subsequent PR since it could have broader ramifications (e.g. allowing op_strategies to return an output_strategy.input_specs` mask explicitly, explicitly allowing `None`s in `PlacementStrategy.input_specs`, creating a `Null` DTensorSpec etc.). That's why I'm using an ignore arg-type directive there for now. - Regarding `2` and `3` above, I don't introspect `op_schema.op._schema.arguments` to verify any `None` arguments are `torch.OptionalType`, leaving adherence to the schema contract the responsibility of the given op. Regarding `2`, I assume any `desired_spec` will be either a `DTensorSpec` or `None`, so only `None` can be Falsy in this context. - I considered altering the active `args_schema`, which could be inspected and aligned with the active `output_strategy.input_specs` in some cases and avoid the changes in `3`, but I think that would rely on one of (among other possibilities): - all supported op signatures having optional Tensors (`DTensorSpec`) args after required tensors (which isn't a planned required as far as I know), - (somewhat brittle) heuristic-driven arg alignment - only supporting kwargs etc. ### Added Tests To facilitate detection of future `requires_grad` pattern op failure modes as `DTensor` evolves, I added the following two tests: 1. `test/distributed/_tensor/test_math_ops.py DistMathOpsTest.test_layer_norm_bwd_req_grad` - Tests `native_layer_norm_backward` specifically with 20 subtests that sweep valid `output_mask` patterns along in different LayerNorm dimensionality and `elementwise_affine` configurations. 2. `test/distributed/tensor/parallel/test_tp_examples.py DistTensorParallelExampleTest.test_transformer_req_grad` - Samples a subset of `requires_grad` patterns in a more realistic (relative to the `LayerNorm`-specific test) Transformer usage context with different `dtype` and `is_seq_parallel` configurations. Note since there was substantial overlap with the existing `test_transformer_training` test, I took the opportunity to refactor that test to allow relevant code-sharing. I also added an `ExpCommCounts` `NamedTuple` to facilitate the addition of additional `requires_grad` patterns that we may want to test in the future which may result in different comm counts. I created the separate `requires_grad` test to allow decoupling the multi-iteration `test_transformer_training` test and allow addition of new `requires_grad` scenarios as desired while being mindful of resources. Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community! Pull Request resolved: https://github.com/pytorch/pytorch/pull/133502 Approved by: https://github.com/XilunWu	2024-08-24 05:49:54 +00:00
Yanbo Liang	8d3c6494ae	[Inductor][FlexAttention] Rename IS_LAST_BLOCK to CHECK_BLOCK_BOUNDARY (#134378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134378 Approved by: https://github.com/drisspg	2024-08-24 04:40:01 +00:00
Xu Han	5ad759ca33	[inductor] calibration inductor windows uts (2/N) (#134358 ) skip unsupported UTs of `test\inductor\test_compile_worker.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134358 Approved by: https://github.com/jansel	2024-08-24 04:08:59 +00:00
wz337	5ae9c01794	[DTensor] Add naive replicate strategy for aten._linalg_eigh.default (#134284 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134284 Approved by: https://github.com/awgu	2024-08-24 03:50:05 +00:00
wz337	962e1f6ca7	[DTensor] Add aten.any.default,dim,out to linear_reduction_strategy (#134206 ) For `aten.any`, we can use `reduce_op="sum"` as the linear reduction op. When we do `all_reduce` with `reduce_op="sum"` on bool tensor, if one rank returns `torch.Tensor([True]) `, then the reduction result is `torch.Tensor([True]) `. Only when all ranks return `torch.Tensor([False]) ` would the reduction result be `torch.Tensor([False]) `. This matches with `any`'s behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134206 Approved by: https://github.com/tianyu-l, https://github.com/chuanhaozhuge	2024-08-24 03:49:46 +00:00
wz337	5d39b14b68	[DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839 ) Add DeviceMesh slicing support such that we could do the following: ``` mesh_3d = init_device_mesh( self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp") ) shard_cp_mesh = mesh_3d["shard", "cp"]._flatten() hsdp_mesh = mesh_3d["replicate", "shard_cp"] # we can get the corresponding group of the flatten mesh through group = shard_cp_mesh.get_group() # or group = mesh_3d["shard_cp"].get_group() # or mesh_3d.get_group(mesh_dim="shard_cp") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839 Approved by: https://github.com/fegin ghstack dependencies: #133838	2024-08-24 03:49:29 +00:00
Akash Kaothalkar	195abdb85c	ppc64le: VSX Support for Inductor (#132746 ) ### Description This PR extends the `VecISA` class to include support for VSX on the `ppc64le` architecture within the Inductor backend. This enhancement enables vectorization support, resulting in performance improvements when using `torch.compile()` on `ppc64le`. ### Fixes - Resolved the `test_acosh_with_negative_large_input` test case in `test_cpu_repro.py` by implementing `acosh` for VSX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132746 Approved by: https://github.com/jansel	2024-08-24 03:36:09 +00:00
Sheng Fu	519342962d	Pass process group info into NcclWork (#134269 ) Summary: Pass process group info into NcclWork Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: D61677160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134269 Approved by: https://github.com/wconstab	2024-08-24 01:04:43 +00:00
Justin Chu	e2a87fb1e9	[ONNX] Update exporter logic (#134304 ) Sync the exporter logic with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15. https://github.com/pytorch/pytorch/issues/129277 - Create a `testing` module to facilitate testing model accuracy. The model is internal - Improve decomp table - Improve model verification logic - Add tests The next PRs will enable OpInfo tests and clean up existing code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134304 Approved by: https://github.com/titaiwangms	2024-08-24 00:49:54 +00:00
rzou	a1d0b4d568	Add option to skip functional passes in the pattern matcher's replacement graph (#134364 ) The pattern matcher runs DCE and remove_noop_ops on the replacement graph by default. Previously we had a switch for the DCE. This PR changes that switch to also control if we run remove_noop_ops. The context was that there is silent incorrectness with auto_functionalized. We use the Pattern matcher to decompose auto_functionalized into a mutable op + clones; remove_noop_ops were deleting the clones. Future: can try #134363 Test Plan: - new test. I wasn't able to produce a silently incorrect example so I settled for asserting that clones still exist in the post-grad graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134364 Approved by: https://github.com/eellison ghstack dependencies: #133639	2024-08-24 00:38:55 +00:00
Jason Ansel	2c8fc3f4ce	[inductor] Move imports to top of file in generated code (#134195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134195 Approved by: https://github.com/eellison ghstack dependencies: #134194	2024-08-24 00:35:57 +00:00
Jason Ansel	1aa0e35a04	[inductor] Remove dead code in multi_kernel.py (#134194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134194 Approved by: https://github.com/eellison	2024-08-24 00:35:57 +00:00
Yidi Wu	4ff1a4dd0f	[export] support set_grad_enabled hop in dynamo to enable re-tracing (#134281 ) As titled. We added dynamo support for wrap_with_set_grad_enabled hop to support re-trace an exported program. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134281 Approved by: https://github.com/tugsbayasgalan	2024-08-24 00:35:53 +00:00
drisspg	9dc47f5e62	[FlexAttention]Fix how we realize input buffers (#134351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134351 Approved by: https://github.com/Chillee	2024-08-24 00:31:00 +00:00
Tristan Rice	4c28a0eb0b	c10d/logging: add C10D_LOCK_GUARD (#134131 ) This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s. This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things. This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate. Test plan: existing CI for regressions will add unit tests on `C10D_LOCK_GUARD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-24 00:27:39 +00:00
atalman	e52e93e8fd	Update scale-config files with linux.24xlarge.ephemeral (#134380 ) Add linux.24xlarge.ephemeral to scale config Pull Request resolved: https://github.com/pytorch/pytorch/pull/134380 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-24 00:01:39 +00:00
Pian Pawakapan	54ff320519	[export] refactor ExportGraphSignature construction (#134059 ) Refactors construction of ExportGraphSignature object for export & training IR, explicitly creating AOTAutograd signature for training IR. This will be helpful for upcoming refactors for placeholder naming & runtime asserts prettifying. Changes: - dedups `make_argument_spec` call, moved to export/graph_signature.py - `_sig_to_specs` wrapped into new function `_convert_to_export_graph_signature`, directly converts GraphSignature -> ExportGraphSignature - `_make_fx_helper` explicitly creates AOTAutograd GraphSignature object Pull Request resolved: https://github.com/pytorch/pytorch/pull/134059 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-08-23 23:29:28 +00:00
leslie-fang-intel	aa9f4cc733	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-08-23 23:26:51 +00:00
fduwjj	286f2dba9f	[2/N refactor NCCLPG error logs][c10d] Make msg in monitoring thread in NCCLPG more accurate and simpler (#134036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134036 Approved by: https://github.com/wconstab	2024-08-23 23:21:28 +00:00
Yiming Zhou	2cfc2da527	[export] Make move_to_device_pass function public (#134263 ) Summary: This is a follow-up of https://github.com/pytorch/pytorch/pull/133660 Here we make the `move_to_device_pass()` function publich so users can call it by `from torch.export.passes import move_to_device_pass` Test Plan: CI Differential Revision: D61671310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134263 Approved by: https://github.com/angelayi	2024-08-23 23:18:30 +00:00
cyyever	c638a40a93	[Caffe2] Remove unused AVX512 code (#133160 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133160 Approved by: https://github.com/albanD	2024-08-23 23:16:16 +00:00
Xinran / Allan Rui	1f19ccb5b3	[Inductor/Triton] Customize triton codegen to optionally preserve input dtype on tl.load (#132406 ) Differential Revision: D60536337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132406 Approved by: https://github.com/jfix71, https://github.com/blaine-rister	2024-08-23 22:58:43 +00:00
Pian Pawakapan	8ff3a5be1b	[export] basic auto dynamic shapes (#133620 ) Starter version of automatic dynamic shapes for export. Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining". The usage for specifying `dynamic_shapes` is now: ``` AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static None/int/STATIC -> static Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified. ``` Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic: - specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None}) ``` Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic: - The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related ``` Implementation details: This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints. For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}` This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are: ``` None/unspecified: dynamic by default Dim/DerivedDim: also a strict assertion ``` If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620 Approved by: https://github.com/avikchaudhuri	2024-08-23 22:56:39 +00:00
Angela Yi	f5a2a22dc4	[export] Fix unflattener to respect nn.Parameter requires_grad (#134353 ) Summary: Fixes P1539870235 Test Plan: CI Differential Revision: D61726403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134353 Approved by: https://github.com/pianpwk	2024-08-23 22:49:34 +00:00
Juan Torrente	eaa2c0e009	Improves error message when passing wrong tensor type to torch.nn.functional.one_hot (#134209 ) The function expects a Tensor of type LongTensor. It currently throws the following error: "one_hot is only applicable to index tensor." which, imo, does not provide the user with enough information on what the problem is. PR simply adds extra information to the error message on this specific scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134209 Approved by: https://github.com/mikaylagawarecki	2024-08-23 22:40:05 +00:00
Nikita Shulga	09a82f3d24	[EZ][BE] Delete references to non-existing `AWS_SCCACHE` secrets (#134370 ) First of all, none of the binary builds should be using sccache for security and reliability reasons (as distributed cache can become corrupted/compromised), but even if they do all authentication to AWS service shoudl be done via OIDC Pull Request resolved: https://github.com/pytorch/pytorch/pull/134370 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-08-23 22:23:48 +00:00
Nikita Shulga	adf0f50cc7	[Compile] Add NEON implementation for bf16->fp32 cast (#134297 ) This changes assembly generated for the following routine ```cpp void bfloat16tofloat(c10::BFloat16* in, float* out) { auto tmp0 = at::vec::Vectorized<c10::BFloat16>::loadu(in, 8); auto tmp1 = at::vec::convert<float>(tmp0); tmp1.store(out); } ``` from ```asm bfloat16tofloat(c10::BFloat16, float): 0000000000000034 stp x29, x30, [sp, #-0x10]! 0000000000000038 mov x29, sp 000000000000003c sub x9, sp, #0x90 0000000000000040 and sp, x9, #0xffffffffffffffe0 0000000000000044 mov x8, #0x0 0000000000000048 adrp x9, 0 ; 0x0 000000000000004c ldr x9, [x9] 0000000000000050 ldr x9, [x9] 0000000000000054 str x9, [sp, #0x88] 0000000000000058 stp xzr, xzr, [sp, #0x10] 000000000000005c ldr q0, [x0] 0000000000000060 str q0, [sp] 0000000000000064 ldr q1, [sp, #0x10] 0000000000000068 stp q0, q1, [sp, #0x20] 000000000000006c add x9, sp, #0x40 0000000000000070 add x10, sp, #0x20 0000000000000074 add x11, x10, x8 0000000000000078 ldp d0, d1, [x11] 000000000000007c shll.4s v0, v0, #16 0000000000000080 shll.4s v1, v1, #16 0000000000000084 stp q0, q1, [x9], #0x20 0000000000000088 add x8, x8, #0x10 000000000000008c cmp x8, #0x20 0000000000000090 b.ne 0x74 0000000000000094 add x8, sp, #0x40 0000000000000098 ld1.4s { v0, v1 }, [x8] 000000000000009c st1.4s { v0, v1 }, [x1] 00000000000000a0 ldr x8, [sp, #0x88] 00000000000000a4 adrp x9, 0 ; 0x0 00000000000000a8 ldr x9, [x9] 00000000000000ac ldr x9, [x9] 00000000000000b0 cmp x9, x8 00000000000000b4 b.ne 0xc4 00000000000000b8 mov sp, x29 00000000000000bc ldp x29, x30, [sp], #0x10 00000000000000c0 ret 00000000000000c4 bl 0xc4 ``` to ```asm bfloat16tofloat(c10::BFloat16, float): 0000000000000034 ldr q0, [x0] 0000000000000038 shll.4s v1, v0, #16 000000000000003c shll2.4s v2, v0, #16 0000000000000040 st1.4s { v1, v2 }, [x1] 0000000000000044 ret ``` And as result speeds up `python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16` from 33 to 90 tokens/sec Pull Request resolved: https://github.com/pytorch/pytorch/pull/134297 Approved by: https://github.com/kimishpatel	2024-08-23 22:22:59 +00:00
Yiming Zhou	69813dbbfd	[export] Schematize nn_module_stack serialization (#134049 ) `nn_module_stack` was previously serialized to string by adding commas between the module_path and module_type. This error prone when the `nn_module_stack` itself contains commas. This PR fixes this by creating a dictionary to store the `nn_module_stack` and serialize it to string via `json.dumps()` Fixes #131941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134049 Approved by: https://github.com/angelayi	2024-08-23 21:50:01 +00:00
Yifu Wang	78d69bfe11	[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 ) ### Summary - Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address. - Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation). ### Benchmark 8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support. ![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947) ![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe) Differential Revision: [D61682507](https://our.internmc.facebook.com/intern/diff/D61682507) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424 Approved by: https://github.com/yf225, https://github.com/weifengpy	2024-08-23 20:09:20 +00:00
Qiaochu Yuan	2ca7f0fc5c	[Minimizer] for sequential mode, respect find_all setting (#134339 ) Summary: Currently, for sequential mode, minimizer search terminates after a node is excluded via the user defined exclusion_fn. However, on some occasions we would like the search to continue past that for the remaining nodes. In this diff I am changing the termination criteria to respect the find_all setting, where we continue sequential search if it is set. Test Plan: CI Differential Revision: D61720262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134339 Approved by: https://github.com/jfix71	2024-08-23 19:59:43 +00:00
Daniel Dale	58e2cf364b	Make DTensor sharding propagation for `scaled_dot_product_efficient_attention` and `scaled_dot_product_flash_attention` more conservatively cached (#134146 ) Fixes #134050 ### The issue The current `DTensor` sharding propagation caching policy for `aten.scaled_dot_product_efficient_attention` (default) can result in silently incorrect gradients or trigger an IMA after cuda kernel launch in mixed `require_grad` configurations. Please see issue #134050 for a full description of the observed failure patterns along with reproduction. Note `aten.scaled_dot_product_flash_attention` presents a similar concern so this PR addresses both [as discussed here.](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602) ### Remediation While there are a number of ways this could be addressed, the most straightforward remediation is to modify the sharding propagation caching policy of [`aten._scaled_dot_product_efficient_attention.default`](`b03381cac2/torch/distributed/_tensor/ops/_matrix_ops.py (L337-L340)`), registering it with `schema_info=RuntimeSchemaInfo(4)` to prevent cache sharing between differing `compute_log_sumexp` values i.e. ```python @register_op_strategy(aten._scaled_dot_product_efficient_attention.default, schema_info=RuntimeSchemaInfo(4)) def scaled_dot_product_efficient_attention_strategy( ... ``` [As discussed here](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602), since `aten::_scaled_dot_product_flash_attention` could be affected by a similar issue wrt `return_debug_mask`, this PR adjusts the sharding propagation caching policy for that op as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134146 Approved by: https://github.com/tianyu-l	2024-08-23 19:43:30 +00:00
Jesse Cai	157de30f53	[sparse] Update cuSPARSELt to v0.6.2 (#134022 ) Summary: This PR updated cuSPARSELt to v0.6.2. I think we should land https://github.com/pytorch/pytorch/pull/128534 first though. Most of this PR is just enabling tests to run when cuSPARSELt v0.6.2 is available. Unfortunately was running into a bug with fp32 support on Hopper, so I removed fp32 support from the cuSPARSELt backend. I think this should be fine since almost everybody uses the bfloat/float16/int8 kernels. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134022 Approved by: https://github.com/jerryzh168, https://github.com/malfet ghstack dependencies: #128534	2024-08-23 19:34:53 +00:00
Angela Yi	74a9001ada	[aoti] Add additional custom op input type support (#132454 ) Summary: Added support for more custom op input types, now only missing dtype, layout, memory format as input type, since we need to add some more testing for mapping the types to their integer values ([previous comment](https://github.com/pytorch/pytorch/pull/126215#discussion_r1617428066)). This PR also replaces the `DynamicArg` struct's `serialized_arg_val` with `list_item_types`, which stores an optional list of strings, where each string represents the type of the value within this list. This is only used for parsing lists of optional tensors, where we need to know if a specific value in the list should be a tensor, or a None. Replacing with a list of strings is also better than storing the actual json format because then we don't need to parse the json string during the runtime, and can just loop over a preprocessed list of strings. Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r "test_custom_"` Reviewed By: desertfire Differential Revision: D60295995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132454 Approved by: https://github.com/desertfire	2024-08-23 19:11:36 +00:00
James Wu	f8fbfe5846	Always emit end events even on failure, use thread local storage for stack (#134279 ) Summary: We should always emit an end event in a finally block so that if a unit test or job fails, the stack is still correct. Also, we use thread local storage for the stack, so that in multithreaded scenarios the stack will still be correctly added. Test Plan: Run benchmark and see that everything still works Run ``` TORCH_LOGS=dynamo buck run test/functorch:test_aotdispatch -- -r test_backward_mutation_on_grad_out ``` With some extra logging to see that start events with the correct stack are emitted, and the end events are also emitted even though the test fails at runtime. Differential Revision: D61682556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134279 Approved by: https://github.com/aorenste	2024-08-23 18:13:13 +00:00
Yidi Wu	a23d86c178	[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645 Approved by: https://github.com/zou3519	2024-08-23 17:28:02 +00:00
Jia Li	3546628a2a	Allow mp.start_processes to create processes in parallel (#133707 ) Summary: Background discussion in https://fb.workplace.com/groups/319878845696681/posts/1226087421742481 and pytorch issue filed https://github.com/pytorch/pytorch/issues/133010 one way to fix this problem is to add an option to parallel start processes on pytorch side. Test Plan: Tested aps run in problem and things are in parallel now (next diff) Differential Revision: D61301603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133707 Approved by: https://github.com/d4l3k, https://github.com/ezyang	2024-08-23 17:11:20 +00:00
rzou	afd081c9d4	[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 ) Fixes #128084 The approach is option 2 of what Elias suggested in the comment thread: - We require tensors to have the correct stride at usage. This may involve a clone; if there was a clone and then a mutation into it then we copy_ back the result of the mutation. The reason why I went this approach was because it was the easiest and Inductor already works really hard to remove additional clones/copy_. There are some cases that this doesn't generate efficient code for; for example, if the tensor is a view, we don't change the base of the view to have the right stride order, instead we do a clone. The view case isn't very common so I'm ignoring it for now but we could improve this in the future. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639 Approved by: https://github.com/eellison	2024-08-23 17:07:58 +00:00
Tristan Rice	2553278bae	.github/merge_rules.yaml: added multiprocessing to Distributed (#134262 ) This allows the Distributed team to approve changes to torch.multiprocessing which is used by torchelastic/run. Example PR: https://github.com/pytorch/pytorch/pull/133707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134262 Approved by: https://github.com/wconstab, https://github.com/PaliC	2024-08-23 17:07:20 +00:00
Xuehai Pan	6eae569546	[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 ) We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987 Approved by: https://github.com/jansel	2024-08-23 16:28:57 +00:00
Yanbo Liang	2eef749b31	[Inductor][FlexAttention] Fix IS_DIVISIBLE bug and add unit tests (#134055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134055 Approved by: https://github.com/Chillee	2024-08-23 16:11:09 +00:00
IvanKobzarev	8ae4f82243	[aotd] Support HOP effects in backward (#132638 ) Support of effectful operations in backward: 1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function . FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`. 2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper. Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs. 2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals. _aot_autograd/utils.py 3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward. For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward. There are additional changes in partitioner to improve functionality of 'must_be_in_backward' 4/ Unlift tokens now should run for both forward and backward. - As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation - In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs. 5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens. Tests: ``` python test/higher_order_ops/test_with_effects.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638 Approved by: https://github.com/bdhirsh	2024-08-23 15:30:58 +00:00
PyTorch MergeBot	7fd3b69886	Revert "[dynamo][super] Improve handling of getattr on super (#134039 )" This reverts commit 1da3a049dac3c78554506d5ef9ede55b7c2b774d. Reverted https://github.com/pytorch/pytorch/pull/134039 on behalf of https://github.com/jeanschmidt due to broke internal torchrec signals, see [D61670727](https://www.internalfb.com/diff/D61670727) ([comment](https://github.com/pytorch/pytorch/pull/134039#issuecomment-2307151643))	2024-08-23 13:57:04 +00:00
PyTorch MergeBot	09127b096c	Revert "[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 )" This reverts commit 8604c0a150b12e0ba3f9a6faaf52498370f21368. Reverted https://github.com/pytorch/pytorch/pull/133639 on behalf of https://github.com/jeanschmidt due to Broke internal fbgemm signals, see [D61670495](https://www.internalfb.com/diff/D61670495) ([comment](https://github.com/pytorch/pytorch/pull/133639#issuecomment-2307133060))	2024-08-23 13:48:04 +00:00
PyTorch MergeBot	75c22dd8bf	Revert "[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 )" This reverts commit b23779ef0af8d4f06e667da460c43d264359f1f0. Reverted https://github.com/pytorch/pytorch/pull/133987 on behalf of https://github.com/albanD due to This breaks windows trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/133987#issuecomment-2306956764))	2024-08-23 12:08:56 +00:00
Xuehai Pan	0e49b2f18e	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779	2024-08-23 10:13:12 +00:00
Xuehai Pan	8d90392fb0	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778	2024-08-23 10:10:19 +00:00
Xuehai Pan	6c0b15e382	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133769	2024-08-23 09:10:44 +00:00
Xuehai Pan	cc3a76edba	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel	2024-08-23 09:05:24 +00:00
Su, Tong	ca3f48dd5b	[XPU] Set `make triton` install pre-built whl by default (#130313 ) Now the user could install the pre-built `triton` for xpu by calling the following: ```Bash export USE_XPU=1 make triton ``` [Dev Only]: If the user wishes to build it from the source, one could set an additional flag: ```Bash export TRITON_XPU_BUILD_FROM_SOURCE=1 export USE_XPU=1 make triton ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130313 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman	2024-08-23 07:36:34 +00:00
Luca Wehrstedt	55cdcef0f7	[fp8 rowwise] Work around CUDA Invalid Memory Access bug (#134227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134227 Approved by: https://github.com/drisspg, https://github.com/eqy ghstack dependencies: #134223, #134224, #134225, #134226	2024-08-23 07:27:55 +00:00
Luca Wehrstedt	9d81767d43	[fp8 rowwise] Rework dispatch logic (#134226 ) It's likely a matter of opinion, but I find this new version to have less duplication, even if it might have more boilerplate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134226 Approved by: https://github.com/drisspg ghstack dependencies: #134223, #134224, #134225	2024-08-23 07:27:55 +00:00
Luca Wehrstedt	0afb4872aa	[fp8 rowwise] Support non-contiguous inputs and clarify checks (#134225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134225 Approved by: https://github.com/drisspg ghstack dependencies: #134223, #134224	2024-08-23 07:27:52 +00:00
Luca Wehrstedt	9f8d3f511f	[fp8 rowwise] Some clean-up (#134224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134224 Approved by: https://github.com/drisspg ghstack dependencies: #134223	2024-08-23 07:27:48 +00:00
Luca Wehrstedt	2f198605ac	[fp8 rowwise] Simplify epilogue visitor tree via common blocks (#134223 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134223 Approved by: https://github.com/drisspg	2024-08-23 07:27:41 +00:00
Xuehai Pan	25b2e46573	[dynamo] add max iterator limit while inlining generators (#134233 ) Related: - #133879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134233 Approved by: https://github.com/jansel	2024-08-23 07:03:31 +00:00
xingyuan li	673b9bd561	[WIP] [Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_multi_kernel.py` (#133943 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_multi_kernel.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133943 Approved by: https://github.com/EikanWang, https://github.com/jansel Co-authored-by: Justin Chu <justinchu@microsoft.com> Co-authored-by: Jesse Cai <jcjessecai@gmail.com> Co-authored-by: Sahdev Zala <spzala@us.ibm.com> Co-authored-by: rzou <zou3519@gmail.com> Co-authored-by: FFFrog <ljw1101.vip@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: yanbing-j <yanbing.jiang@intel.com> Co-authored-by: Will Feng <yf225@cornell.edu> Co-authored-by: Bin Bao <binbao@meta.com> Co-authored-by: Yiming Zhou <yimingzhou@meta.com> Co-authored-by: Yanbo Liang <ybliang8@gmail.com>	2024-08-23 05:52:29 +00:00
Xu Han	80846caa8c	[inductor] fix dynamic size array(vla) build error on msvc v4 (#134221 ) MSVC don't support dynamic array. Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler We tried to solutions: 1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs. 2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments: https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower. 3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number. > a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729 > b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains. Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think: 1. MSVC it the only one compiler, which not support VLA. 2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works. 3. For other compilers, keep using current `VLA` code. 4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC. 5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner. Reproduce UT: ```cmd pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads ``` Error msg: ```cmd C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads' C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable ``` Genarated code: ```c++ #include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h" extern "C" __declspec(dllexport) void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0, float* out_ptr1) { { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); int max_threads = omp_get_max_threads(); float tmp_acc0_arr[max_threads]; for (int tid = 0; tid < max_threads; tid++) { tmp_acc0_arr[tid] = 0; } at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads]; for (int tid = 0; tid < max_threads; tid++) { tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221 Approved by: https://github.com/zhuhaozhe, https://github.com/jansel	2024-08-23 05:40:08 +00:00
Xu Han	49b9f2d8b0	[inductor] fix signbit build fail on Windows. (#134229 ) Reproduce UT: ```cmd pytest test/inductor/test_torchinductor.py -v -k test_randint_int64_mod_cpu ``` Error message: ```cmd cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental' c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): error C2668: 'signbit': ambiguous call to overloaded function C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or 'bool signbit(double) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or 'bool signbit(long double) noexcept' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): note: while trying to match the argument list '(__int64)' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): error C2668: 'signbit': ambiguous call to overloaded function C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or 'bool signbit(double) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or 'bool signbit(long double) noexcept' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): note: while trying to match the argument list '(int64_t)' ``` Genarated code: ```c++ #include "C:/Users/Xuhan/AppData/Local/Temp/tmpcjnxnvkl/4f/c4ff4q4pxgo3yprbo2nkfopkt3qgi6rmptfpgpl2iylgtunvizwn.h" extern "C" __declspec(dllexport) void kernel(const int64_t* in_ptr0, int64_t* out_ptr0) { #pragma omp parallel num_threads(8) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0LL); x0<static_cast<int64_t>(20LL); x0+=static_cast<int64_t>(1LL)) { auto tmp0 = in_ptr0[static_cast<int64_t>(0LL)]; auto tmp1 = x0; auto tmp2 = c10::convert<int32_t>(tmp1); auto tmp3 = static_cast<int64_t>(-5); auto tmp4 = static_cast<int64_t>(5); auto tmp5 = randint64_cpu(tmp0, tmp2, tmp3, tmp4); auto tmp6 = static_cast<int64_t>(10); auto tmp7 = mod(tmp5, tmp6); auto tmp8 = static_cast<int32_t>(0); auto tmp9 = tmp7 != tmp8; auto tmp10 = std::signbit(tmp7); auto tmp11 = std::signbit(tmp6); auto tmp12 = tmp10 != tmp11; auto tmp13 = tmp9 & tmp12; auto tmp14 = decltype(tmp7)(tmp7 + tmp6); auto tmp15 = tmp13 ? tmp14 : tmp7; out_ptr0[static_cast<int64_t>(x0)] = tmp15; } } } } ``` Fixed by cast `std::signbit` to `long double`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signbit?view=msvc-170 Local test passed: <img width="848" alt="image" src="https://github.com/user-attachments/assets/e4467256-a068-40ef-a6ff-19b442e9116d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134229 Approved by: https://github.com/jansel	2024-08-23 05:40:05 +00:00
Huamin Li	311af3b988	Add new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked (#134232 ) Summary: This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff. We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI Reviewed By: jerryzh168 Differential Revision: D61395887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232 Approved by: https://github.com/houseroad	2024-08-23 04:54:26 +00:00
Xuehai Pan	b23779ef0a	[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 ) We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987 Approved by: https://github.com/jansel	2024-08-23 04:33:05 +00:00
Animesh Jain	a699bd1155	[dynamo] Cache _dynamo.disable results (#134272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272 Approved by: https://github.com/yf225, https://github.com/jansel	2024-08-23 04:20:50 +00:00
Avik Chaudhuri	b454c51060	remove dynamic_dim (#134211 ) Summary: As promised in https://github.com/pytorch/pytorch/pull/134045. Test Plan: existing Differential Revision: D61646937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134211 Approved by: https://github.com/angelayi	2024-08-23 04:13:03 +00:00
Rachel Guo	058302494c	[AOTI][Tooling] Add a test case where `config.debug_intermediate_value_printer=True` to check codegen (#133326 ) Summary: As title. Add a test case in test_aot_inductor to check for codegen (i.e. `aoti_torch_print_tensor_handle` is inserted as expected for debugging printer) for both cpu and cuda based on a simple `addmm` test model. Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_codegen_abi_compatible_{cuda/cpu} ``` Differential Revision: D61169068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133326 Approved by: https://github.com/ColinPeppler	2024-08-23 02:12:21 +00:00
Yanbo Liang	d2c60749ac	[Inductor][FlexAttention] Respect user's input kernel_options (#134065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134065 Approved by: https://github.com/Chillee	2024-08-23 01:21:05 +00:00
fduwjj	8301add833	[4/N] Further refactor FR script to make it more modulized (#134196 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134196 Approved by: https://github.com/c-p-i-o	2024-08-23 01:15:29 +00:00
Shivam Raikundalia	bcfc560aea	[Profiler/CPU] Add Test for Dynamic Activity Toggling [4/n] (#134149 ) Summary: Add tests that check function events for dynamic activity toggling for both GPU and CPU events. Also added comments from previous GH comments Test Plan: Make sure all tests pass Differential Revision: D61617514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134149 Approved by: https://github.com/aaronenyeshi	2024-08-23 01:13:42 +00:00
drisspg	bf5addb613	[FlexAttention] Enable different qk and v head-dims (#134043 ) # Summary Adds the option for the head dims to be different between QK and V tensors. Fixes issue: https://github.com/pytorch/pytorch/issues/133674 V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540 Into PyTorch's triton branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043 Approved by: https://github.com/Chillee	2024-08-23 01:06:57 +00:00
Bin Bao	7c93c4f8cf	[CI][dashboard] Change aarch64 perf run (#134265 ) Summary: Reduce the aarch64 dashboard run to only test the default config, until we solve the timeout issue. Also increase the frequency from nightly to 6 times a day, to see if we can reproduce the perf instability Nikita has observed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134265 Approved by: https://github.com/malfet	2024-08-23 00:40:28 +00:00
Animesh Jain	b3821f1da1	[dynamo][guards][logs] Generate code_parts for debugging (#134181 ) Fixes https://github.com/pytorch/pytorch/issues/132692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134181 Approved by: https://github.com/youkaichao, https://github.com/jansel ghstack dependencies: #133742, #134016, #134039	2024-08-22 23:40:37 +00:00
Dan Johnson	edbadc904b	Do not broadcast uniqueId during a split (#133962 ) When using split, we do not need to exchange the NCCL uniqueID at all. This would avoid connecting to the TCPStore on each split operation. @exported-using-ghexport Differential Revision: [D60966980](https://our.internmc.facebook.com/intern/diff/D60966980/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133962 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #133960, #133961	2024-08-22 23:23:32 +00:00
Eli Uriegas	b2eb0e8c6a	docker: Use miniforge, install from pip (#134274 ) Switch installation of the pytorch package to be installed from our download.pytorch.org sources which are better maintained. As well, switching over the miniconda installation to a miniforge installation in order to ensure backwards compat for users expecting to have the conda package manager installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134274 Approved by: https://github.com/malfet, https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2024-08-22 23:20:22 +00:00
Stonepia	30d7e7a1cd	[XPU] Fix patch for old llvm package error for triton xpu (#134204 ) Fixes #134199 The PR #133694 does a workaround to replace the str `"https://tritonlang.blob.core.windows.net/llvm-builds/"` with `"https://oaitriton.blob.core.windows.net/public/llvm-builds/"` in `triton/python/setup.py`. However, in [newer version of Triton](`06e6799f4e`), it has already been changed to `"https://oaitriton.blob.core....` and don't need to be replaced. But formerly, this will throw a runtime error. This PR makes the `check_and_replace` logic won't fail in such a scenario. Both the old link and the newer link could work. Also note that the `.ci/docker/common/install_triton.sh` does not need the fix, because its `sed` command won't be in effect if there is no such pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134204 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman	2024-08-22 23:18:44 +00:00
drisspg	629bd6f718	Update FlexAttention with masking semantic (#133373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373 Approved by: https://github.com/yanboliang	2024-08-22 22:50:33 +00:00
fduwjj	e7929809f3	[c10d][ez] Add comments to CudaEventCache class (#134172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134172 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2024-08-22 22:44:12 +00:00
Justin Chu	b319fa3fd9	[ONNX] Opt into ruff fmt (#134120 ) Add ONNX directory to use ruff format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-08-22 22:44:03 +00:00
Dan Johnson	25499de814	Remove ncclIdToCommMap_. (#133961 ) There is no purpose for this map structure, and it is incorrect in some cases. For example, when the uniqueID is not broadcasted to the other processes. @exported-using-ghexport Differential Revision: [D60966882](https://our.internmc.facebook.com/intern/diff/D60966882/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133961 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #133960	2024-08-22 22:06:25 +00:00
Shangdi Yu	b0cf287b46	[export][training ir migration] Fix getitem not exist (#134259 ) Summary: Make quantization tests compatible with the new training IR. With the new batch norm node `torch.ops.aten.batch_norm.default`, we don't need an additional getitem node after the bn node, so tests need to be fixed to not check for the getitem node. We added a capture_pre_autograd_graph_using_training_ir() function, which returns True when we are using the training ir, and False otherwise. This way, the code supports both training ir and the old ir. For now, we are just rolling out the training ir for fbcode internal tests. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_preserve_source_fn_stack buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_update_shared_qspec buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_relu_fusion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion_literal_args ``` Reviewed By: andrewor14, tugsbayasgalan Differential Revision: D61292102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134259 Approved by: https://github.com/tugsbayasgalan	2024-08-22 22:00:14 +00:00
Bin Bao	f0ba309d78	[CI][dashboard] Add jemalloc back for aarch64 (#134189 ) Forward fix based on https://github.com/pytorch/pytorch/pull/133997#discussion_r1726004220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134189 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-08-22 21:08:39 +00:00
Dan Johnson	1b6bbaa016	Remove PMI dependencies in PyTorch (#133960 ) This patch makes two changes: 1. Whenever ncclCommSplit accepts groupRanks in its config, we should populate it. This is independent of using PMI or not. For example, non-PMI NCCL can also use this information, if it chooses to. 2. Provide a user flag to decide when to do a uniqueId broadcast and when to skip it. This is a performance optimization, and not a correctness requirement. If the user forgets to set this, we will do the uniqueId broadcast, which is wasteful (because it will be ignored by NCCL), but not incorrect. @exported-using-ghexport Differential Revision: [D60966774](https://our.internmc.facebook.com/intern/diff/D60966774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133960 Approved by: https://github.com/shuqiangzhang	2024-08-22 20:34:43 +00:00
Yanbo Liang	ff61f55387	[Dynamo][autograd.Function] Supports ctx.set_materialize_grads (#133978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133978 Approved by: https://github.com/zou3519	2024-08-22 20:06:17 +00:00
Zain Rizvi	5633773188	Convert various jobs to be Linux Foundation fleet compatible (#134128 ) Migrates a batch of workflows over to LF Pull Request resolved: https://github.com/pytorch/pytorch/pull/134128 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-08-22 19:23:07 +00:00
Jeff Daily	0eb9c870fd	[reland][ROCm] TunableOp for gemm_and_bias (#128919 ) Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias` Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919 Approved by: https://github.com/malfet	2024-08-22 18:27:50 +00:00
Shangdi Yu	978c5a80a0	[export][training ir migration] fix batch norm pattern match in quantization (#134157 ) Summary: In the new training ir, we produce `torch.ops.aten.batch_norm.default` instead of `torch.ops.aten._native_batch_norm_legit.default` or `torch.ops.aten._native_batch_norm_legit_no_training.default`. So we need to change the pattern match to accomodate the new op. - Add `torch.ops.aten.batch_norm.default` to pattern matcher list so it's identified as a batch norm node - `torch.ops.aten.batch_norm.default` doesn't have a getitem user anymore, so when removing the bn norm, we need to do `bn_node.replace_all_uses_with(conv_node)` instead of `getitem_node.replace_all_uses_with(conv_node)` The behavior of capture_pre_autograd_graph is consistent for each run. If the run is a fbcode test, then capture_pre_autograd_graph uses training IR. This means both _get_aten_graph_module_for_pattern and replace_pattern_with_filters see the same training IR. If the run is not a fbcode test, then both would see the old IR. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_binary2 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_unary buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_quant_linear buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_dynamic_quant_linear buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_flatten_recipe buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary ``` Reviewed By: andrewor14, tugsbayasgalan Differential Revision: D61291077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134157 Approved by: https://github.com/tugsbayasgalan	2024-08-22 18:25:45 +00:00
Animesh Jain	fee677eeb6	[fbode-testing][dynamo][reland][inline-inbuilt-nn-modules] Mark attri… (#134136 ) Shuai wants to test this internally before https://github.com/pytorch/pytorch/pull/133713 can go in. Creating a separate PR for ghmport. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134136 Approved by: https://github.com/yanboliang	2024-08-22 17:54:58 +00:00
Thanh Ha	8f7d66f0c3	Enable dynamic rollout for Linux binary workflows (#131472 ) Enables dynamic migration of jobs to the LF AWS account for binary workflows. The new runners are only given to people specified in this issue: pytorch/test-infra#5132 This closes pytorch/ci-infra#251. Depends-On: pytorch/pytorch#132870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131472 Approved by: https://github.com/ZainRizvi	2024-08-22 17:12:50 +00:00
Aaron Orenstein	d95aedf5fd	[BE] typing for decorators - fx/_compatibility (part 1) (#134202 ) Part of #134054. This corresponds to the pytorch mypy changes from D61493706. Updating takes so long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change. So landing these 'type: ignore' for pytorch in advance of them actually being needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202 Approved by: https://github.com/Skylion007	2024-08-22 17:07:33 +00:00
yuqingj	44fa9f991c	[NJT] add aten.to.dtype support (#134164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134164 Approved by: https://github.com/davidberard98	2024-08-22 16:59:38 +00:00
Xuehai Pan	b6abac68ec	[BE][dynamo] reorganize polyfill module hierarchy (#133977 ) Changes: 1. Move `polyfill.py` -> `polyfills/__init__.py`. It can be used as `polyfill.xxx` -> `polyfills.xxx`. 2. Move submodule loading from `polyfills/__init__.py` to `polyfills/loader.py`. Merge `polyfill.py` and `polyfills/` packages. Each polyfill module have its own namespace for better code organization. The ultimate goal is make `polyfills/__init__.py` empty and all polyfill functions move to its own namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133977 Approved by: https://github.com/jansel	2024-08-22 16:42:29 +00:00
Xuehai Pan	c95ddd4bf2	[dynamo] ensure polyfill function has the same signature as the original function in `substitute_in_graph` (#133813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133813 Approved by: https://github.com/jansel	2024-08-22 16:38:06 +00:00
Shangdi Yu	240467adfe	[fx] Implement deepcopy for Proxy (#133706 ) Summary: When deepcopy a proxy, we first try the default deepcopy behavior. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r proxy_deepcopy Differential Revision: D61398418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133706 Approved by: https://github.com/angelayi	2024-08-22 16:37:30 +00:00
PyTorch MergeBot	b0171c3920	Revert "[ONNX] Opt into ruff fmt (#134120 )" This reverts commit 0870398fa8c3e097640f31cb8a8e2e2d3e522d33. Reverted https://github.com/pytorch/pytorch/pull/134120 on behalf of https://github.com/albanD due to Breaks main branch lint ([comment](https://github.com/pytorch/pytorch/pull/134120#issuecomment-2305089756))	2024-08-22 15:48:14 +00:00
Simon Mahns	828ab84e19	Improve error msg on _lazy_init() error (#134159 ) Reviewed By: hanzlfs Differential Revision: D61627609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134159 Approved by: https://github.com/hanzlfs	2024-08-22 15:10:50 +00:00
James Wu	3c5485fb7f	[Retry] Log chromium events to scuba (#134118 ) Summary: This diff implements a bunch of views for internal scuba viewing. TODOS that I might punt to another diff: - Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more. - We should definitely log frame id, compile id, etc - We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on. - idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk Test Plan: All of the above views are run with nanogpt benchmark: ``` buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance ``` Differential Revision: D61603243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118 Approved by: https://github.com/oulgen	2024-08-22 14:59:45 +00:00
Isuru Fernando	1b10a5c652	Allow SymInts and SymFloats as other in div_softmax_pattern (#133989 ) Fixes https://github.com/pytorch/pytorch/issues/133759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133989 Approved by: https://github.com/ezyang	2024-08-22 14:36:01 +00:00
Vladimir Monakhov	afc2615d33	Add proper casting to fuse_linear_bn_weights (#134105 ) As per title, this PR adds proper casting to fuse_linear_bn_weights in the same style as the conv case above. This previously caused numerical issues on my end, so that is why I am fixing it. Also cleans up the docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134105 Approved by: https://github.com/mikaylagawarecki	2024-08-22 14:26:12 +00:00
yuqingj	b459ca78eb	[NJT]Add unit tests that cover the internal use cases using new NJT API (#133513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133513 Approved by: https://github.com/davidberard98, https://github.com/soulitzer	2024-08-22 13:54:40 +00:00
PyTorch MergeBot	1a7e8e5780	Revert "Update FlexAttention with masking semantic (#133373 )" This reverts commit 5a7b544e5c3e37bea62c6a231f6230c004a33d38. Reverted https://github.com/pytorch/pytorch/pull/133373 on behalf of https://github.com/jeanschmidt due to Broke internal test/inductor signals, see D61611729 ([comment](https://github.com/pytorch/pytorch/pull/133373#issuecomment-2304714503))	2024-08-22 13:47:26 +00:00
PyTorch MergeBot	88c973005d	Revert "[FlexAttention] Enable different qk and v head-dims (#134043 )" This reverts commit e847b6bb9ba281b0db83fcdd79c328252403e9e8. Reverted https://github.com/pytorch/pytorch/pull/134043 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert https://github.com/pytorch/pytorch/pull/133373, feel free to reland this after solving conflicts ([comment](https://github.com/pytorch/pytorch/pull/134043#issuecomment-2304708996))	2024-08-22 13:44:17 +00:00
Aaron Gokaslan	83b5d449a3	Add full float16/bfloat16 support to MaxUnPool (#133774 ) It already supported half so might as well add bfloat16 support for parity Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-22 13:34:43 +00:00
Aaron Gokaslan	c9c84ae3ee	[BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007 ) Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes. # Bug fix * Fixed an issue where custom dropout mask was not correctly applied. * Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend. * Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches. * Fixed an issue in sdpa fp8 fprop operation (in inference mode). # Samples * Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation. * Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007 Approved by: https://github.com/eqy	2024-08-22 13:34:17 +00:00
Howard Huang	108a75b454	[PP] Add ZeroBubble schedule (#133467 ) Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467 Approved by: https://github.com/wconstab ghstack dependencies: #132691	2024-08-22 13:32:15 +00:00
PyTorch MergeBot	cedfac20c7	Revert "[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 )" This reverts commit 66d3eb783c3b3d7087988dd29bfb619b7f4306b7. Reverted https://github.com/pytorch/pytorch/pull/133424 on behalf of https://github.com/jeanschmidt due to Broke internal ADS builds, see D61611517 ([comment](https://github.com/pytorch/pytorch/pull/133424#issuecomment-2304676328))	2024-08-22 13:29:27 +00:00
Andrew Gu	592a172910	[FSDP2] Resolved strided sharding todo in clipping tests (#134152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134152 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wz337	2024-08-22 12:45:13 +00:00
Jez Ng	4c645c04d8	Fix type of get_raw_stream (#134187 ) Just something I noticed while implementing a new DeviceInterface I had to add `# type: ignore[assignment]` because mypy thinks DeviceInterface.get_raw_stream is a `Callable` and therefore incompatible with a `staticmethod`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134187 Approved by: https://github.com/jansel	2024-08-22 12:00:08 +00:00
Xu Han	5fb8754434	[inductor] write cpp code with encoding utf-8 (#134027 ) Windows is different to Linux, each Windows version with different language pack have different code page. Inductor on Windows will write the genarated cpp code with its code page, and it should occured un-decode character failed. For this situlation, Microsoft suggest to use Unicode to instead of a specific code page. Ref: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers Changes: 1. Use `utf-8` as encoder for cpp code. 2. It only change encode for cpp code, but not for binary type. binary type is for AoT binary context. It works on https://github.com/pytorch/pytorch/issues/122094#issuecomment-2299592942. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134027 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/jansel	2024-08-22 11:54:32 +00:00
Luca Wehrstedt	aea1148d56	[fp8 rowwise] Clarify dtypes (#134114 ) Disambiguate some of the dtypes (e.g., for the scales), move the "constant" ones out of the function, and use safe casting functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134114 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111, #134112, #134113	2024-08-22 11:07:39 +00:00
Luca Wehrstedt	72586ccd14	[fp8 rowwise] Don't build separate kernel for no bias (#134113 ) CUTLASS automatically skips a stage in the epilogue if we provide a nullptr. Thus, instead of building a special kernel for bias=None, we can reuse one of the other ones. This also considerably simplifies the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134113 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111, #134112	2024-08-22 11:07:39 +00:00
Luca Wehrstedt	d64fa11095	[fp8 rowwise] Fix bias calculation being done in low precision (#134112 ) The compute dtype for the bias addition was set to ElementBias. Thus, for a bf16 bias, we would cast the fp32 accum to bf16 and _then_ add the bias. It is however (slightly?) more accurate to first add the bias in fp32 and only cast at the end. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134112 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111	2024-08-22 11:07:34 +00:00
Luca Wehrstedt	15faed60ca	[fp8 rowwise] Make schedule selection more readable (#134111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134111 Approved by: https://github.com/drisspg ghstack dependencies: #134110	2024-08-22 11:07:30 +00:00
Luca Wehrstedt	b8ea5b01c9	[fp8 rowwise] Allocate workspace as a PyTorch Tensor (#134110 ) This makes us pass through the CUDA caching allocator which is safer e.g. in case of CUDA graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134110 Approved by: https://github.com/drisspg	2024-08-22 11:07:26 +00:00
cyy	4c8193b8f0	[14/N] Fix clang-tidy warnings in aten/src/ATen (#132733 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132733 Approved by: https://github.com/ezyang	2024-08-22 10:09:15 +00:00
Zitong Zhan	90c821814e	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 ) This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve. Fixes #69538 Minimum example of usage: ``` import torch if __name__ == '__main__': spd = torch.rand(4, 3) A = spd.T @ spd b = torch.rand(3).to(torch.float64).cuda() A = A.to_sparse_csr().to(torch.float64).cuda() x = torch.linalg.solve(A, b) print((A @ x - b).norm()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn Co-authored-by: Zihang Fang <zhfang1108@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-08-22 07:57:30 +00:00
Pearu Peterson	64cfcbd8a3	Tune _int_bsr_dense_addmm for int8 inputs on A100 (#134035 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134035 Approved by: https://github.com/cpuhrsch ghstack dependencies: #133855	2024-08-22 06:43:11 +00:00
Feng Yuan	b7baa062fc	Update torch-xpu-ops pin (ATen XPU implementation) (#133850 ) Bugfixings for PyTorch 2.5, 1. Using SYCL group algorithm API instead of old style for sub group shift utilities. 2. Add preprocess in reduction kernel for cases requiring data type cast. 3. Make group norm memory format compatible. 4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_. 5. Rebase checkIndexTensorTypes usage. 6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850 Approved by: https://github.com/EikanWang	2024-08-22 06:27:03 +00:00
Yuanhao Ji	cdb9c7d228	Add support for using privateuse1 backend name in `instantiate_device_type_tests()` (#133082 ) As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend. For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type. ```diff - instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1') - instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1') + instantiate_device_type_tests(TestIndexing, globals(), only_for='foo') + instantiate_device_type_tests(NumpyTests, globals(), only_for='foo') ``` > https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655 The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082 Approved by: https://github.com/albanD	2024-08-22 06:17:21 +00:00
Chong Gu	24c2dd2002	Migrate fuse_chunk_reshape_concat_pass to PT2 (#134026 ) Summary: This is part of the work of dper pass migration https://fburl.com/gdoc/wxwykxns This pass has ~2.4% perf impact for adfinder_reels_ctr_model Test Plan: Still in test Differential Revision: D60789747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134026 Approved by: https://github.com/huxintong	2024-08-22 06:13:52 +00:00
chilli	938f37b745	Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964 ) Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065, Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964 Approved by: https://github.com/Skylion007	2024-08-22 05:29:49 +00:00
Xu Han	e2ff094008	[inductor] calibration inductor windows uts (1/N) (#134033 ) Changes: 1. Re-open fixed UTs. 2. Mark skiped reasons for failed UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134033 Approved by: https://github.com/jansel	2024-08-22 05:21:28 +00:00
Avik Chaudhuri	0d7ac1966a	kill sharing of constraints (#134045 ) Summary: Previously, reuse of the same `Dim` was encoded by "sharing" internal constraints among constraint targets. This kind of sharing, implemented using `shared` fields between `_Constraint`s, was originally motivated by `dynamic_dim`, specifically to support `==` between `dynamic_dim`s, but we no longer need to maintain this overcomplicated structure: we can simply use names of `Dims` to directly encode sharing information. Thus this PR vastly simplifies the structure of `_Constraint` by removing `shared` fields. As a result, both `_Constraint` and its moral subclass, `_DerivedConstraint`, are 1-1 with `Dim` and its moral subclass, `DerivedDim`. Note that this will break `==` over `dynamic_dim`, so an immediate follow-up will be to remove `dynamic_dim` entirely from our public API. (It's been more than 6 months since the deprecation warning anyway.) I just didn't want to deal with that process in the same PR. Test Plan: existing Differential Revision: D61559413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134045 Approved by: https://github.com/pianpwk	2024-08-22 04:40:47 +00:00
Wil Kong	de06345e9b	Avoid Host & Device Sync In LR Scheduler (#133663 ) Fixes #133662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133663 Approved by: https://github.com/janeyx99, https://github.com/eqy Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-22 03:52:43 +00:00
drisspg	e847b6bb9b	[FlexAttention] Enable different qk and v head-dims (#134043 ) # Summary Adds the option for the head dims to be different between QK and V tensors. Fixes issue: https://github.com/pytorch/pytorch/issues/133674 V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540 Into PyTorch's triton branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043 Approved by: https://github.com/Chillee	2024-08-22 03:42:17 +00:00
Yanbo Liang	7868b65c4d	[Dynamo] Support dict.setdefault (#134083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134083 Approved by: https://github.com/williamwen42	2024-08-22 01:57:33 +00:00
Yiming Zhou	7b20514f8e	[export] Device remapping in export (#133660 ) Implemented `move_to_device_pass()` function in `torch._export.passes`. The user has to explicitly call this method to move the exported program from one torch.device to another one. Fixes https://github.com/pytorch/pytorch/issues/121761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133660 Approved by: https://github.com/angelayi	2024-08-22 01:03:35 +00:00
Bin Bao	df467f8746	[CI] Do not set Intel OMP for aarch64 (#133997 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/133997 Approved by: https://github.com/angelayi	2024-08-22 00:55:46 +00:00
Will Feng	6bddfb9546	[FSDP2] Add cache for FSDP wrapper class (#134135 ) Currently, `fully_shard` will create a new `FSDPMyModuleClass` class for each `MyModuleClass` module object, which causes Dynamo to guard-fail on every module object's type checking. This PR fixes the issue by caching and reusing previously created FSDP wrapper class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134135 Approved by: https://github.com/awgu	2024-08-22 00:41:30 +00:00
yanbing-j	2a73ba298c	Upgrade submodule oneDNN to v3.5.3 (#131620 ) This PR is to upgrad submodule oneDNN to v3.5.3. ## Improvements - [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users. - Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support. - Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only. ## Validation results on CPU No regression was found. 1. NLP models accuracy/inference/training Model Name \| Mode Name \| Precision \| OneDNN \| Baseline \| OneDNN/Baseline -- \| -- \| -- \| -- \| -- \| -- bert-large \| realtime \| bf16 \| 192.498 \| 189.664 \| 1.014942214 bert-large \| throughput \| bf16 \| 202.424 \| 202.156 \| 1.001325709 bert-large \| train_phase2 \| bf16 \| 15.955 \| 16.029 \| 0.995383368 LCM \| throughput \| bf16 \| 1.01983 \| 1.06632 \| 0.956401455 stable-diffusion \| throughput \| bf16 \| 0.10313 \| 0.10184 \| 1.012666929 ViT \| realtime \| bf16 \| 1086.48 \| 928.43 \| 1.17023362 ViT \| throughput \| bf16 \| 1419.07 \| 1393.81 \| 1.018122987 yolov7 \| realtime \| bf16 \| 413.468682 \| 415.16503 \| 0.995914039 yolov7 \| throughput \| bf16 \| 369.697 \| 366.789 \| 1.007928264 bert-large \| realtime \| fp32 \| 46.685 \| 46.652 \| 1.000707365 bert-large \| throughput \| fp32 \| 47.766 \| 48.007 \| 0.994979899 bert-large \| train_phase2 \| fp32 \| 7.101 \| 7.104 \| 0.999577703 LCM \| throughput \| fp32 \| 0.5501 \| 0.55023 \| 0.999763735 stable-diffusion \| throughput \| fp32 \| 0.04012 \| 0.04002 \| 1.002498751 ViT \| realtime \| fp32 \| 337.27 \| 335.19 \| 1.006205436 ViT \| throughput \| fp32 \| 346.52 \| 350.08 \| 0.989830896 yolov7 \| realtime \| fp32 \| 107.138054 \| 107.242747 \| 0.999023775 yolov7 \| throughput \| fp32 \| 103.383 \| 104.301 \| 0.99119855 bert-large \| realtime \| int8 \| 283.541 \| 289.569 \| 0.979182855 LCM \| throughput \| int8 \| 1.09864 \| 1.08998 \| 1.0079451 stable-diffusion \| throughput \| int8 \| 0.10617 \| 0.10604 \| 1.001225952 ViT \| realtime \| int8 \| 1562.11 \| 1554.68 \| 1.004779119 ViT \| throughput \| int8 \| 1904.38 \| 1903.39 \| 1.000520125 yolov7 \| realtime \| int8 \| 540.489493 \| 539.902488 \| 1.001087243 yolov7 \| throughput \| int8 \| 499.999 \| 500.757 \| 0.998486292 Device \| Dtype \| Geomean Higher is better -- \| -- \| -- All \| all \| 101.17% All \| fp32 \| 99.83% All \| bf16 \| 102.24% All \| int8 \| 99.91% All \| fp16 \| 103.61% SPR \| all \| 100.54% SPR \| fp32 \| 99.82% SPR \|bf16 \| 101.78% SPR \|int8 \| 99.90% GNR \| all \| 101.58% GNR \| fp32 \| 99.85% GNR \| bf16 \| 102.66% GNR \| int8 \| 99.93% GNR \| fp16 \| 103.61% 2. Torchbench cpu userbenchmark inference & training Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.00x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 0.99x eager_throughtput_bf16_train \| 1.01x eager_throughtput_fp32_train \| 1.00x 3. Inductor quantization Static quant: Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x ACC_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x Dynamic quant: \| Ratio (oneDNN/baseline) -- \| -- Performance \| 1.04x Accuracy \| 1.00x 4. Dynamo benchmarks GEOMEAN summary ![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158) FP32 Static shape, default wrapper ![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c) FP32 Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822) AMP Static shape, default wrapper ![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2) AMP Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6) ## Validation results on XPU Category \| Eager \| Inductor -- \| -- \| -- huggingface_amp_fp16_training \| 1.002456 \| 0.999998 huggingface_bfloat16_inference \| 1.005386 \| 1.003511 huggingface_float32_training \| 1.002533 \| 1.003098 torchbench_amp_fp16_training \| 1.009065 \| 1.01323 torchbench_bfloat16_inference \| 1.003371 \| 1.001534 torchbench_float32_training \| 1.012102 \| 1.011596 timm_models_amp_fp16_training \| 1.005511 \| 1.010329 timm_models_bfloat16_inference \| 1.000935 \| 1.000538 timm_models_float32_training \| 0.991873 \| 0.99721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-08-21 23:40:02 +00:00
Nikita Shulga	5f0bd98767	Increase max total number of dynamo partitions to 15 (#134153 ) Needed to be able to split some of the aarch64 workflows to 15 shards Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-21 23:10:12 +00:00
FFFrog	a5ef04a3b8	add relevant function (#133946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133946 Approved by: https://github.com/ezyang	2024-08-21 23:04:59 +00:00
rzou	8604c0a150	[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 ) Fixes #128084 The approach is option 2 of what Elias suggested in the comment thread: - We require tensors to have the correct stride at usage. This may involve a clone; if there was a clone and then a mutation into it then we copy_ back the result of the mutation. The reason why I went this approach was because it was the easiest and Inductor already works really hard to remove additional clones/copy_. There are some cases that this doesn't generate efficient code for; for example, if the tensor is a view, we don't change the base of the view to have the right stride order, instead we do a clone. The view case isn't very common so I'm ignoring it for now but we could improve this in the future. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639 Approved by: https://github.com/eellison	2024-08-21 22:54:16 +00:00
Sahdev Zala	d2204d4f0f	Remove skip ci recommendation (#134134 ) Using `skip ci` is no longer a recommendation practices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134134 Approved by: https://github.com/soulitzer	2024-08-21 22:42:25 +00:00
Jesse Cai	255cd75a97	[sparse] Add cuSPARSELt as a backend (#128534 ) Summary: This PR adds in cuSPARSELt as a backend to PyTorch. It is now possible to see if cuSPARSELt is available and the version if it is with ``` torch.backends.cusparselt.is_available() torch.backends.cusparselt.version() ``` Test Plan: ``` python test/test_sparse_semi_structured.py -k test_cusparselt_backend ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534 Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed	2024-08-21 22:06:07 +00:00
Justin Chu	0870398fa8	[ONNX] Opt into ruff fmt (#134120 ) Add ONNX directory to use ruff format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-08-21 21:43:55 +00:00
Gufan Yin	96dfe95ed0	Fix DDPLoadBalancingPlanner docstring (#134044 ) Summary: 1. Indentation in chunk function was wrong. 1. The previous logic missed a level of zip. This diff uses the idiom in python zip doc to do chunking https://docs.python.org/3/library/functions.html#zip Test Plan: Run the docstring locally Differential Revision: D61548758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134044 Approved by: https://github.com/fegin	2024-08-21 21:28:22 +00:00
Bin Bao	5d5a45dc85	[CI][dashboard] Collect Export pass rate separately (#134076 ) Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076 Approved by: https://github.com/angelayi	2024-08-21 21:18:55 +00:00
Nikita Shulga	b3eef3deaf	Triple number of shards for aarch64 cpu inductor tests (#134123 ) Let's see if this will work. Alas, other than linting I can only test it after it lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/134123 Approved by: https://github.com/clee2000	2024-08-21 20:52:23 +00:00
Pearu Peterson	345578afb4	Add int8 support to bsr_dense_addmm and bsr_dense_mm Triton kernels (#133855 ) As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855 Approved by: https://github.com/cpuhrsch	2024-08-21 20:44:40 +00:00
Pavel Belevich	a3e1416c05	Fix out_tensor device in diag_test.py (#134020 ) This benchmark fails if device='cuda' but out_tensor is on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/134020 Approved by: https://github.com/soulitzer	2024-08-21 20:43:39 +00:00
Animesh Jain	6c1e2d2462	[easy] Force inline_inbuilt_nn_modules to remove divergence (#134122 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134122 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-08-21 20:42:15 +00:00
Valentin Andrei	865facda44	[pytorch] Remove thread naming when torch is imported (#134066 ) Fixes #133690 The naming was added in #121170 to allow performance debugging of latency critical threads. However the `pt_main_thread` name gets inherited every time a new process or thread is created from the parent one, which defeats the purpose. We need a better way to name the thread that launches kernels on accelerators but for the time being we can let users name the threads in the application code, using: `torch.multiprocessing._set_thread_name("insert_name")` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134066 Approved by: https://github.com/soulitzer, https://github.com/d4l3k	2024-08-21 20:34:35 +00:00
PyTorch MergeBot	1491a61769	Revert "[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 )" This reverts commit 696107efcb83f9359aa669ab343c2cfa2a111372. Reverted https://github.com/pytorch/pytorch/pull/133645 on behalf of https://github.com/ydwu4 due to breaking ci. probably due to land race ([comment](https://github.com/pytorch/pytorch/pull/133645#issuecomment-2302866106))	2024-08-21 19:33:14 +00:00
Shangdi Yu	5fcfccefc6	[export] Migrate `capture_pre_autograd_graph` to `_export_for_training` (#132815 ) Summary: as title Test Plan: CI Differential Revision: D60860909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132815 Approved by: https://github.com/tugsbayasgalan	2024-08-21 19:00:41 +00:00
Nikita Shulga	18aaceb7be	Update conda-env-iOS.txt (#134068 ) Followup after https://github.com/pytorch/pytorch/pull/133814 To fix periodic build failures update `typing-extensions` to 4.11.0, as 4.10 is missing in conda Pull Request resolved: https://github.com/pytorch/pytorch/pull/134068 Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007	2024-08-21 18:47:14 +00:00
David Berard	84b3f1900a	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Differential Revision: [D61550977](https://our.internmc.facebook.com/intern/diff/D61550977) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-21 18:40:54 +00:00
Sahdev Zala	05304f59f0	[Doc] Fix typo in `torch/fx/passes/README.md` (#134078 ) Fix typo, `utis` to `utils`, in the utility name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134078 Approved by: https://github.com/soulitzer, https://github.com/malfet	2024-08-21 18:35:50 +00:00
Edward Z. Yang	32e057636c	Enable scribe environment for compile-time benchmarks if requested. (#133891 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891 Approved by: https://github.com/malfet	2024-08-21 18:02:54 +00:00
atalman	750d68ff70	Use amazon linux2 for Docker builds, fix build-docker-conda condition (#134116 ) 1. Switches failing jobs to amzon linux 2: - CUDA, CPU, ROCM jobs are failing 3. Fix trigger condition for build-docker-conda to be same as manywheel and libtorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/134116 Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia	2024-08-21 18:01:16 +00:00
Yidi Wu	696107efcb	[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645 Approved by: https://github.com/zou3519 ghstack dependencies: #133521	2024-08-21 17:34:21 +00:00
Yidi Wu	6835f20d20	[HOP] support generating schema for hop (#133521 ) Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop. We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521 Approved by: https://github.com/zou3519	2024-08-21 17:34:21 +00:00
Xintong Hu	dd5a7c8397	[PT2] Add a pass to convert stack to unsqueeze cat (#133966 ) Summary: so that we can optimize with `fuse_chunk_reshape_unsqueeze_concat_pass` Test Plan: new UT Reviewed By: frank-wei Differential Revision: D61220221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133966 Approved by: https://github.com/frank-wei	2024-08-21 17:31:26 +00:00
Animesh Jain	1da3a049da	[dynamo][super] Improve handling of getattr on super (#134039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039 Approved by: https://github.com/yanboliang ghstack dependencies: #133742, #134016	2024-08-21 16:50:35 +00:00
Zhengxu Chen	3ef1cc8583	[export] Implement common_getitem_elimination pass. (#133618 ) Summary: In export, we will generate many redundant getitem nodes branching from the same source, inserted by runtime assertions or any passes. This is causing issues with any downstream system relying on any value being uniquely defined by a single node. I don't think it hurt to remove a bunch of getitem nodes only, so I just added to the ctor. Test Plan: rebase on D61256937 ``` buck2 run scripts/bearzx:pt2_export_playground ``` Differential Revision: D61351578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133618 Approved by: https://github.com/tugsbayasgalan	2024-08-21 16:48:24 +00:00
PyTorch MergeBot	2db28a9611	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit bce0caba7804b0787684dbf1f4e1c4d9e3acded5. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/ezyang due to root cause of internal failures not addressed ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2302466444))	2024-08-21 16:13:34 +00:00
IvanKobzarev	57625bacea	[partitioner] Fix must_be_in_backward corner cases (#134002 ) Preparation PR for https://github.com/pytorch/pytorch/pull/132638 "must_be_in_backward" fails the partitioner, if partitioner picks this node as saved_values. The fix is to prevent partitioner to pick those nodes during nodes classification. It's hard to make a test without making effectful ops in backward "must_be_in_backward", which will be testing this ( https://github.com/pytorch/pytorch/pull/132638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134002 Approved by: https://github.com/bdhirsh ghstack dependencies: #134003	2024-08-21 15:58:49 +00:00
PyTorch MergeBot	68425e68fe	Revert "[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714 )" This reverts commit e8d3c4be3629582294b5944754009fae60f42f6d. Reverted https://github.com/pytorch/pytorch/pull/133714 on behalf of https://github.com/anijain2305 due to fails internally ([comment](https://github.com/pytorch/pytorch/pull/133714#issuecomment-2302171472))	2024-08-21 14:21:06 +00:00
ooooo	32e052e468	[docs] improve `torch.stack` example code to be reproducible (#133857 ) Improve the sample code can produce the expected results after copying and executing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133857 Approved by: https://github.com/soulitzer	2024-08-21 14:07:02 +00:00
blazej-smorawski	585c049fa3	Fix `Extension` attribute name in `CppExtension` example (#134046 ) Hi! It seems there's a typo in `CppExtension` example. I think it should say `extra_link_args` instead of `extra_link_flags`. Not that I spent a few hours debugging missing kernels inside a library's fatbin or anything :D. Please see `Extension` definition inside setuptools: `ebddeb36f7/setuptools/_distutils/extension.py (L62)` Thanks! Błażej Pull Request resolved: https://github.com/pytorch/pytorch/pull/134046 Approved by: https://github.com/soulitzer	2024-08-21 13:58:16 +00:00
Aaron Gokaslan	afaa5fcecb	[BE][Ez]: FURB142,FURB92 misc preview fixes (#133880 ) Fixes some miscellaneous code quality issues with some refurb rules that have not been enabled yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133880 Approved by: https://github.com/soulitzer, https://github.com/malfet	2024-08-21 13:54:51 +00:00
rzou	683609c631	Skip cpp_extension test internally (#134011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134011 Approved by: https://github.com/masnesral	2024-08-21 13:51:05 +00:00
Howard Huang	4b1fb3b0ed	[PP] pt-native input/weight grad split (#132691 ) Add `stage_backward_input` and `stage_backward_weight` functions to perform the weight updates for inputs and weights independently. We still support `self.dw_builder` argument for a custom backward, but it has become optional. It takes a separate code path and cannot be used in conjuction with the native zero backward. Added tests: `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble` `python test/distributed/pipelining/test_backward.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132691 Approved by: https://github.com/wconstab	2024-08-21 13:37:54 +00:00
leslie-fang-intel	2bffbe06bd	[Inductor][CPP] Support vectorization of load_seed and randn (#130317 ) Summary Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961	2024-08-21 13:20:43 +00:00
leslie-fang-intel	313bc11963	[inductor][cpp] complete vectorization for int32/int64 (#122961 ) Summary Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961 Approved by: https://github.com/jansel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-21 13:12:38 +00:00
Xuehai Pan	539be0a769	[dynamo] support `ClassMethodDescriptorType` (#133862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133862 Approved by: https://github.com/jansel	2024-08-21 12:56:19 +00:00
Animesh Jain	0d79f67a25	[dynamo][exception] Support raise exception from None (#134028 ) Fixes https://github.com/pytorch/pytorch/issues/132362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134028 Approved by: https://github.com/yanboliang	2024-08-21 12:48:35 +00:00
Animesh Jain	bd0db490bf	[dynamo][set] Fix EQUALS_MATCH guard for constant sets and lists (#134016 ) Fixes https://github.com/pytorch/pytorch/issues/133509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134016 Approved by: https://github.com/laithsakka, https://github.com/jansel ghstack dependencies: #133742	2024-08-21 12:41:52 +00:00
Xuehai Pan	c929e1e11f	[dynamo] fix polyfill for user defined constructor `__new__` (#133822 ) In `cls->tp_call`, if `cls->tp_new` does not return an instance of class `cls`, then `cls->tp_init` is not called on the new instance. Related PR: - #132977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133822 Approved by: https://github.com/jansel	2024-08-21 12:41:19 +00:00
Michael Lazos	695291be2f	Fix test flakiness due to not resetting state (#134058 ) Fixes https://github.com/pytorch/pytorch/issues/133994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134058 Approved by: https://github.com/yanboliang	2024-08-21 11:54:08 +00:00
IvanKobzarev	30dc6338c1	[effects] Prevent inductor dtype promotions for HOP effects tokens (#134003 ) Preparation for https://github.com/pytorch/pytorch/pull/132638 and https://github.com/pytorch/pytorch/pull/132755 Inductor promotes arguments dtypes to the highest dtype, as a result additional token tensor argument wtih float32 dtype incurred dtype promotions for lower types, e.g. int32 The solution for that - to use the lowest dtype for tokens - torch.bool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134003 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-21 11:42:10 +00:00
xinan.lin	19eb14493a	[Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. (#132843 ) [Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132843 Approved by: https://github.com/EikanWang, https://github.com/eellison ghstack dependencies: #132740, #132748	2024-08-21 11:28:09 +00:00
xinan.lin	6535f11259	[Inductor] Support _check_triton_bf16_support on XPU. (#132748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132748 Approved by: https://github.com/EikanWang, https://github.com/eellison ghstack dependencies: #132740	2024-08-21 11:28:09 +00:00
xinan.lin	c2e2602ecd	[Inductor] Move `GPU_TYPE`(The runtime avaliable gpu type, cuda or xpu) from (#132740 ) Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-08-21 11:18:00 +00:00
Huamin Li	3d8db41337	Add new op wrapped_quantized_linear (#134024 ) Summary: This diff adds a new operator wrapped_quantized_linear (torch.ops._quantized.wrapped_quantized_linear) and takes the following input argument: input (in fp32) , input_scale, input_zero_point, weight (in fp32), weight_scale, weight_zero_point, bias (in fp32), output_scale, output_zero_point, and out_channel. It does the following 1. Use quantize_per_tensor(input, input_scale, input_zero_point) to quantize the input tensor to int8 2. Use quantized::linear_prepack(weight, weight_scale, weight_zero_point, bias) to pack the weight and bias 3. Use quantized::linear to perform int8 quantized linear 4. dequantize This new op is essentially a wrapper of mutiple ops. We do this as torch.export cannot handle models where it has old quantize apis. Reviewed By: jerryzh168 Differential Revision: D61377266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134024 Approved by: https://github.com/houseroad	2024-08-21 09:26:58 +00:00
Xuehai Pan	022cd7c9aa	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-21 06:36:41 +00:00
Deep Shah	843fdf81c2	Fix a getenv segfault due to a race (#133744 ) Summary: * TLDR: `getenv` is not thread safe w.r.t `setenv`. Environment variables are kept as a per-process "dictionary" by libc. `setenv` can essentially realloc the whole thing move this list to a completely different location. If there is a concurrent `getenv` happening as the same time, it is possible that it might end up reading stale memory and segfault. `getenv` is thread safe w.r.t other `getenv`. * Details: Inside PTD init: ``` ProcessGroupNCCL ctor ... ncclCommWatchdogThread_ = std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this); (https://fburl.com/code/terf9ai7) ``` Inside ncclCommWatchdog thread: ``` ... ncclHeartbeatMonitorThread_ = std::thread(&ProcessGroupNCCL::heartbeatMonitor, this); (https://fburl.com/code/fv9camg2) ... ``` Inside heartbeatMonitor thread: ``` ... std::optional<DumpPipe> dumpPipe = std::nullopt; (https://fburl.com/code/qdvahzbu) dumpPipe.emplace(rank_); ... ``` Inside DumpPipe ctor (https://fburl.com/code/wvixlqcz) ``` getCvarString getenv <=== SIGSEGV ``` On the main thread: We go on to initialize NCCL: Inside getNCCLComm, we call: `getNcclVersion` -> `initEnv` (https://fburl.com/code/j312pccu) `initEnv` inside NCCL does this: `initEnv` -> `setEnvFile` This guy, reads the /etc/nccl.conf file, and sets values of env variables with "setenv" (https://fburl.com/code/cq4r0y0h) This "setenv" can race with "getenv" in heartbeatMonitor thread. Ideally, all `setenv` should be done by a single thread before launching other threads. This diff moves getNCCLVersion before launching watchdog thread to make sure all setenvs are done beforehand. I think we are just getting lucky that we are not hitting it in production. IIRC in fact we saw getenv segfault once in one of the large scale runs, but now I dont remember the details. Test Plan: A lot of testing done as part of D61411062 & CI Differential Revision: D61421292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133744 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-08-21 06:27:31 +00:00
Nicolas Macchioni	af664882dd	Safely infer device type + docstrings + tests (#133668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133668 Approved by: https://github.com/eellison	2024-08-21 05:27:31 +00:00
fduwjj	b39ec7fbe9	[1/N] Make NCCL PG error messages more accurate and simpler (#134017 ) We did a thorough review on all the error messages we are logging inside PGNCCL, and we want to make log message simpler and more accurate, this is the first PR for this effort. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134017 Approved by: https://github.com/wconstab	2024-08-21 05:21:24 +00:00
Yifu Wang	66d3eb783c	[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 ) ### Summary - Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address. - Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation). ### Benchmark 8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support. ![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947) ![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424 Approved by: https://github.com/yf225, https://github.com/weifengpy	2024-08-21 05:11:21 +00:00
Shangdi Yu	8337b4d96e	[training ir migration] Fix ReorderConvertTest (#134010 ) Summary: Change ReorderConvertTest to work with the new `capture_pre_autograd_graph` implementation using D61175223. Note that now `ReorderConvertTest` doesn't work with the old `capture_pre_autograd_graph` anymore. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/passes/tests:optimize_test -- -r ReorderConvertTest ``` Differential Revision: D61507772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134010 Approved by: https://github.com/tugsbayasgalan	2024-08-21 04:48:43 +00:00
Justin Chu	e8fc1e0118	[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 ) 1/n PR to - Move code from torch-onnx from commit `395495e566` into torch.onnx and fixes imports. - Integrate the new export logic with the torch.onnx.export API and include basic set of tests. - Refactor the API for the change. - Improve documentation. Next PRs will be more tests and docs. Fix https://github.com/pytorch/pytorch/issues/129277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-21 01:08:42 +00:00
Sahdev Zala	06cc2e83f0	Make optim.swa.util content accessible from the torch.optim doc (#133393 ) Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc. Currently, if you click the link, https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`. Also, `torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393 Approved by: https://github.com/janeyx99	2024-08-21 00:43:46 +00:00
Nikita Shulga	d1abd6241a	[CI][BE] Update retry action to v3.0.0 (#119403 ) To reduce number of ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20 ``` Finally can land this one as all nodes has been migrated to AmazonLinux2023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119403 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2024-08-20 23:56:37 +00:00
leslie-fang-intel	c42ac54d9e	[inductor] prune unused constants in graph scheduling (#132208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132208 Approved by: https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-20 23:40:11 +00:00
quanta42	5f3d22a609	Avoid GPU syncs by reusing Pre-allocated Zero Tensor (#128069 ) This commit improves the FullyShardedDataParallel (FSDP) class in PyTorch by reducing unnecessary GPU synchronizations by reusing a pre-allocated zero tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128069 Approved by: https://github.com/awgu	2024-08-20 22:51:33 +00:00
drisspg	5a7b544e5c	Update FlexAttention with masking semantic (#133373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373 Approved by: https://github.com/yanboliang	2024-08-20 22:38:10 +00:00
Yanbo Liang	bc785c2d9a	[Inductor][FlexAttention] Don't trigger dynamic shape on building empty block mask (#133836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133836 Approved by: https://github.com/Chillee	2024-08-20 22:36:53 +00:00
Nikita Shulga	f7c1f32803	Fix partially initialized module error (#134019 ) https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda` Fix it by importing `torch.version` explicitly Test Plan: CI Differential Revision: D61549284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019 Approved by: https://github.com/seemethere	2024-08-20 22:20:02 +00:00
Sherlock Huang	41fab40be7	[report_exportability] Avoid re-exporting duplicated modules (#133930 ) Summary: Skip re-exporting modules with the duplicated types to speed up the exportability tests. In real models, there are many duplicated modules, and mostly have the same export issues. Test Plan: Existing CI Differential Revision: D61504630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930 Approved by: https://github.com/angelayi	2024-08-20 22:11:57 +00:00
Animesh Jain	1ae5d5bb62	[dynamo][user-defined] Improve getattr_static for user_defined objects (#133742 ) Fixes https://github.com/pytorch/pytorch/issues/133607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133742 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-08-20 21:51:03 +00:00
atalman	a36739f36a	Cherry-Picking don't resolve conflicts (#134047 ) During cherry-picking we want to use default setting and fail if there is merge conflict Here an example of invalid conflict resolution: https://github.com/pytorch/pytorch/pull/131194 and cherry-pick https://github.com/pytorch/pytorch/pull/133590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134047 Approved by: https://github.com/kit1980	2024-08-20 21:48:05 +00:00
krzysztofjordan	2e1830c7c8	Implement 2D version of masked_select for nestedtensors (#133889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889 Approved by: https://github.com/soulitzer	2024-08-20 21:46:32 +00:00
PyTorch MergeBot	15b5a0b67f	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit 71dd52f51a05d110c06e83f74cef165f64627842. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	88ead0afc6	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit 178e8563b8a44243a6f69f3d257d9a3aab71b2c5. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	3fa874abbe	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit 37b4bc60a4ec65858044983a36577912fb9b4651. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	98e6a1d8ff	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 3f58a8051a92470dbd254859322a7eb085a8f243. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:44 +00:00
PyTorch MergeBot	2540ee372a	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 28ce3c0227830c78c0b5d4ec592f5c3879bc61a3. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:44 +00:00
Justin Chu	ccc0aa69ce	[ONNX] Remove torch.onnx._export (#133824 ) - Remove the deprecated torch.onnx._export function - Remove test/onnx/test_export_modes.py because export modes are no longer supported Pull Request resolved: https://github.com/pytorch/pytorch/pull/133824 Approved by: https://github.com/titaiwangms	2024-08-20 20:54:48 +00:00
Xuehai Pan	b03381cac2	[dynamo] support `cls.__flags__` (#133970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133970 Approved by: https://github.com/jansel ghstack dependencies: #133969	2024-08-20 20:03:31 +00:00
Xuehai Pan	5229b52bf2	[dynamo] support `cls.__base__` (#133969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133969 Approved by: https://github.com/jansel	2024-08-20 20:03:31 +00:00
David Berard	bb0bf09aff	[easy] skip test_sdpa_autocast on windows (#134009 ) test is failing because torch.compile doesn't work on windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/134009 Approved by: https://github.com/YuqingJ, https://github.com/Skylion007, https://github.com/ZainRizvi	2024-08-20 19:51:55 +00:00
Xuehai Pan	28ce3c0227	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778, #133779	2024-08-20 19:48:57 +00:00
Xuehai Pan	3f58a8051a	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778	2024-08-20 19:48:57 +00:00
Xuehai Pan	37b4bc60a4	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769	2024-08-20 19:48:57 +00:00
Xuehai Pan	178e8563b8	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel ghstack dependencies: #133712	2024-08-20 19:48:57 +00:00
Xuehai Pan	71dd52f51a	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-20 19:48:57 +00:00
wz337	49430bfd5c	[DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838 ) ``` # supposed we have a 3d mesh mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp") dp_cp_mesh = mesh_3d["dp", "cp"]._flatten() """ then we would have flatten_name_to_root_dims[mesh_3d]: { "dp_cp": (0, 1) } """ ``` We need this information to validate the order mesh slice including flatten mesh dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838 Approved by: https://github.com/fegin	2024-08-20 19:43:45 +00:00
Zain Rizvi	c188d419db	[BE] [EZ] Allow linux-build workflows to run on the default runner type (#133640 ) Replace usage of `runner` with the new `runner_prefix` input, which allows the workflows to use the default runner type (linux.2xlarge) specified by the reusable workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133640 Approved by: https://github.com/clee2000, https://github.com/jeanschmidt, https://github.com/malfet	2024-08-20 19:37:14 +00:00
Colin Peppler	81a822ddc9	Back out "[1/N] Fix clang-tidy warnings in inductor (#131979 )" (#133922 ) Summary: Original commit changeset: cc9392e5fce2 Original Phabricator Diff: D60464909 Differential Revision: D61501052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133922 Approved by: https://github.com/22quinn	2024-08-20 19:16:29 +00:00
PyTorch MergeBot	49f6ea6dd9	Revert "[report_exportability] Avoid re-exporting duplicated modules (#133930 )" This reverts commit 278bc985d71f1ee09a499fba2ea5032b7baf2567. Reverted https://github.com/pytorch/pytorch/pull/133930 on behalf of https://github.com/izaitsevfb due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/133930#issuecomment-2299513046))	2024-08-20 18:44:09 +00:00
Roy Hvaara	43f78bf37a	[MPS] Gather sliced inputs to batch norm (#133610 ) This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in `4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372)` to decide if gathering is necessary. It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs. ### Performance impact #### With fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 282 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 448 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 705 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 1.11 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.16 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 11.7 msec per loop ``` #### Without fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 284 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 265 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 715 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 675 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.19 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 7.13 msec per loop ``` Please feel free to push back or request changes. Fixes #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610 Approved by: https://github.com/malfet	2024-08-20 18:24:48 +00:00
Sherlock Huang	278bc985d7	[report_exportability] Avoid re-exporting duplicated modules (#133930 ) Summary: Skip re-exporting modules with the duplicated types to speed up the exportability tests. In real models, there are many duplicated modules, and mostly have the same export issues. Test Plan: Existing CI Differential Revision: D61504630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930 Approved by: https://github.com/angelayi Co-authored-by: bearzx <bearzx@fb.com>	2024-08-20 18:20:49 +00:00
Wei Wang	333890b701	Enable CUDA 12.4.1 (#132202 ) Trying to keep a record of the steps before I lose track of it. - 1st Commit: Similar to https://github.com/pytorch/builder/pull/1720 - 2nd Commit: Update CUDA 12.4 CI CUDA versions from 12.4.0 to 12.4.1 mapping to changes in https://github.com/pytorch/pytorch/pull/125944/files - 3rd Commit: update for aarch64 install_cuda_aarch64.sh docker step - 4th Commit: `aaa456e3e6` Related https://github.com/pytorch/pytorch/pull/121684 - Synchronization point: Meta helps uploading pypi cuda dependencies specified in .github/scripts/generate_binary_build_matrix.py - The above pypi upload is done (thanks Andrey!), restarted jobs like https://github.com/pytorch/pytorch/actions/runs/10188203670/job/28369471321 - `77532344e3`, use temporary docker containers (generated from a previous successful container build). If merged, these containers would be rebuilt, therefore testing them now. (5th commit) - 6th commit `5f93c625b5`: revert the 5th commit. Update, done but have to debug seemingly irrelevant failures (rocm/xpu/mps) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132202 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/atalman	2024-08-20 17:52:50 +00:00
fduwjj	e41b520ee3	[3/N] Refactor FR script - Add a processor module (#133933 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133933 Approved by: https://github.com/c-p-i-o ghstack dependencies: #133927, #133929	2024-08-20 17:36:49 +00:00
Aaron Gokaslan	bce0caba78	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-08-20 17:19:57 +00:00
Xu Han	fbf3fc2a30	[inductor] Use int64_t as index type for all platfroms 4 (#133892 ) It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments. 1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: `bdc14ad89a` 2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: `3a56b76ce0` ------------------------------ Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform. ### Development notes: The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms. Current code is make code cumbersome: ```python INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long" ``` So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`. For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768 For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782 Since that, we still discussed which type we will select as final solution. ![image](https://github.com/user-attachments/assets/b23fa577-2d40-4bd6-b934-fb7994fe0bb0) `long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782. As https://github.com/pytorch/pytorch/pull/133782 still has two issues: 1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812 4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR. So, we made final solution in this PR. ### Changes: 1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`. 2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`). 3. Update `parse_arg` function signature to `int64_t`, which follow the index type. 4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long. 5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`. 6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892 Approved by: https://github.com/jansel	2024-08-20 16:54:12 +00:00
Xu Han	3caf3baabb	[inductor] enable inductor backend for dynamo on Windows. (#133921 ) Changes: Enable inductor backend for dynamo on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133921 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-20 16:46:19 +00:00
cyy	c3d02fa390	[Reland2] Update NVTX to NVTX3 (#109843 ) Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843 Approved by: https://github.com/peterbell10, https://github.com/eqy Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>	2024-08-20 16:33:26 +00:00
Animesh Jain	33f1ee036e	[dynamo][user-defined] Simplify call_hasattr (#133935 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133935 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #133745, #133747, #133746, #133799, #133800	2024-08-20 16:27:44 +00:00
cyy	8d93fe510e	Remove NestedTensorFactories.h (#133809 ) Since it has no code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133809 Approved by: https://github.com/ezyang	2024-08-20 16:16:30 +00:00
Aaron Orenstein	187d55018a	[BE] Fix MYPY issues (#133872 ) Fix some mypy issues that have crept in to the trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133872 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-08-20 16:12:04 +00:00
Sam Larsen	52dfe99dbf	Skip test_custom_op_add_abi_compatible_cpu_with_stack_allocation internally (#133704 ) Summary: This test is segfaulting internally. Skip for now so we can get the internal tests green. Differential Revision: D61399618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133704 Approved by: https://github.com/desertfire	2024-08-20 16:01:39 +00:00
PyTorch MergeBot	3a2f7192c3	Revert "return state dict without optimized module (#132626 )" This reverts commit e37eef8a7bd5915fa2961d688fd8b02df5cc5fd7. Reverted https://github.com/pytorch/pytorch/pull/132626 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like this PR broke trunk. distributed/checkpoint/test_state_dict.py::TestStateDict::test_fsdp2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10458281674/job/28969008325) [HUD commit link](`da69a28c6f`) ([comment](https://github.com/pytorch/pytorch/pull/132626#issuecomment-2299190664))	2024-08-20 15:54:54 +00:00
Nikita Shulga	f2b57d8831	Fix `torch._C` submodules population (#133919 ) This fixes regression introduced by https://github.com/pytorch/pytorch/pull/132216 that on some Python runtimes failed with ``` > from torch._C._dynamo.guards import GlobalStateGuard E ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package c:\users\malfet\git\pytorch\torch\_dynamo\convert_frame.py:28: ModuleNotFoundError ``` Simplify it by always registering submodules by its primary name and do not try to add submodules which are not part of the same namespace as parent. Otherwise module can be registered by alias, rather than by primary name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133919 Approved by: https://github.com/atalman, https://github.com/izaitsevfb, https://github.com/XuehaiPan, https://github.com/albanD, https://github.com/Skylion007	2024-08-20 15:38:32 +00:00
Shangdi Yu	b02695d65f	[export] training ir migration, fix export_rle_model (#133937 ) Summary: - exir.capture + to_edge is deprecated. We need to use the export + to_edge. - Fix quantization pass to be compatible with the new export IR. In the quantization pass, some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. We need to consider it. - now export_rle_model works with the default `capture_pre_autograd_graph`, it should also work with the new training it Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r export_rle_model ``` Differential Revision: D61485834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133937 Approved by: https://github.com/tugsbayasgalan	2024-08-20 15:35:25 +00:00
chuanqiw	6590f4fb0e	[CD] Enable python 3.13 for xpu nightly build (#133670 ) Enable python 3.13 for XPU nightly build, it depends on https://github.com/pytorch/pytorch/pull/133454 land. Also update the xpu nightly wheel test env. Works for https://github.com/pytorch/pytorch/issues/114850 Fixes #130543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133670 Approved by: https://github.com/atalman, https://github.com/malfet	2024-08-20 15:05:20 +00:00
fduwjj	36376efd06	[2/N] Refactor FR script - add a loader module (#133929 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133929 Approved by: https://github.com/c-p-i-o ghstack dependencies: #133927	2024-08-20 14:27:40 +00:00
PyTorch MergeBot	2bd02e0c82	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit 641724ed1daad1e6fc2525cc6858d199e576d5cd. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
PyTorch MergeBot	91fd270535	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit 59ca56e56ca3e2f6dd80db57079725cf61f06810. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
PyTorch MergeBot	5109c5ef23	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit ff9be0eda99c59cdbcc269853168657de93043c7. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
Aaron Orenstein	241df7e7f8	Add multi-cache autotune test (#133868 ) Summary: The existing tests didn't cover a case where we had multiple autotunes in a single graph. Add a test to demonstrate that case. Also added a test dependency on redis and removed the "fake redis" from the previous PR (#133579) Test Plan: unit tests Reviewed By: oulgen Differential Revision: D61178861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133868 Approved by: https://github.com/oulgen	2024-08-20 10:26:45 +00:00
Yifu Wang	11af423eca	[SymmetricMemory] make buffer_ptrs_dev, signal_pad_ptrs_dev, buffer_size, and signal_pad_size accessible in python (#133680 ) These allows us to experiment with creative applications with triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133680 Approved by: https://github.com/Chillee	2024-08-20 10:15:35 +00:00
PyTorch MergeBot	08b5e07e6c	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 1fdeb4e32918017ee3a712e0bba86e8482fa293b. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests ([comment](https://github.com/pytorch/pytorch/pull/133779#issuecomment-2298285206))	2024-08-20 08:33:29 +00:00
PyTorch MergeBot	68570fca69	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit 8de56e29581fa2706d44f8c4b0827830c9351470. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/jeanschmidt due to Introduced API linting errors ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2298270307))	2024-08-20 08:26:28 +00:00
PyTorch MergeBot	42097f0ec1	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit cf60fe53a83bafec0857d5b49c2054de6ba4cddc. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to Broke 12k internal signals/jobs, @ezyang please help get those changes merged. More details check D61488368 ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2298210309))	2024-08-20 08:02:49 +00:00
Michael Lazos	25d5a815f7	[Dynamo] Guard on torch function mode global state (#133135 ) Adds guards checking whether torch function mode is in the all disabled state. There are three torch function enablement states: * All torch function disabled (modes + subclasses) * Torch function subclass disabled * All enabled We now have guards checking if the state is All enabled and if state is All disabled. All of the above ternary states are assigned to a unique pair of these two flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133135 Approved by: https://github.com/anijain2305 ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134, #133136	2024-08-20 07:15:04 +00:00
Michael Lazos	48ee0984ac	Add C API to return all torch function disablement status (#133136 ) This PR adds a C function to check if all torch function is disabled. Recall that there are three torch function enablement states: * All disabled * Torch Function Subclass disabled * All enabled The API before this change provides two functions: * `_is_torch_function_enabled` - returns True iff the current TF state is All enabled * `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty. The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case. In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API: * A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state) * A flag to indicate if modes are disabled (ie if All disabled is the current state) * A symbolic stack which can be checked if any modes are present Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136 Approved by: https://github.com/bdhirsh ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134	2024-08-20 07:15:04 +00:00
Michael Lazos	d97ca968cd	[Dynamo] Test intermediate tf mode construction (#133134 ) Ensures that constructing a torch function mode in the middle of a function is supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133134 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131, #133132, #133133	2024-08-20 07:14:56 +00:00
Michael Lazos	626acaeb16	[Dynamo] Support torch function stack len (#133133 ) Adds support for `torch._C._len_torch_function_stack()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133133 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131, #133132	2024-08-20 07:14:52 +00:00
Michael Lazos	d1fdf984c3	[Dynamo] Support push torch function mode stack (#133132 ) This PR adds support `torch._C._push_on_torch_function_stack()` by updating `torch.py` to push onto the symbolic torch function mode stack when a push is encountered. The same side effects infra used in the previous PR is used to track the mutation of the torch function mode stack and add bytecode to update it if it is mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133132 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131	2024-08-20 07:14:47 +00:00
Michael Lazos	c0b4aaa8c5	[Dynamo] Support pop torch function mode stack (#133131 ) This PR adds support for tracing `torch._C._pop_torch_function_stack()` without graph breaking and in order to verify the state change also adds replay of mutations to the torch function mode stack via side_effects appending supplemental bytecode as we do for other python mutable objects. Details: To represent the torch function mode stack symbolically a deque field is added to the instruction translator. When the InstructionTranslator is initialized, all modes are read from the current torch function mode stack, and stashed in a global weak ref for later access (using existing sources) without needing to push/pop the python/cpp torch function mode stack. During tracing, when `_pop_torch_function_stack` is encountered a value is popped from this deque and the variable tracker representing the mode is returned. To ensure the true torch function mode stack matches this state, `TorchFunctionModeStackVariable`, a singleton, is marked as mutated, this adds it to side effects, where during final codegen, side effects will codegen a call to a python helper which will update the python torch function mode stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133131 Approved by: https://github.com/jansel ghstack dependencies: #133130, #133729	2024-08-20 07:14:42 +00:00
Michael Lazos	f147349568	Fix DeviceContext bug (#133729 ) Fixes https://github.com/pytorch/pytorch/issues/133666 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133729 Approved by: https://github.com/bdhirsh ghstack dependencies: #133130	2024-08-20 07:14:37 +00:00
Michael Lazos	09e366cb57	[Dynamo] Add torch function mode stack guard to dynamo (#133130 ) This PR adds a guard on the torch function mode stack state at the beginning of tracing. The way this is implemented is via a new leaf guard which is passed the initial stack state at construction and compares it to the stack state at the time the guard is run. Details: The stack state is extracted via popping all modes, appending them to a list, and pushing all modes back. This list is stored on the output graph and read during guard construction to pass to the stack mode guard. There the length and types of the modes are recorded. Next time the guard is run it compares this recorded state to the current mode stack state. To implement this in python a helper function was added to utils.py and this is used if cpp guards are not enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133130 Approved by: https://github.com/anijain2305	2024-08-20 07:14:33 +00:00
Aaron Orenstein	7492da804f	Mark disabled tests as fixed (#133940 ) Fixes #132552, #133900, #133901, #133902, #133903, #133904, #133905, #133906, #133908, #133910, #133911, #133912, #133913, #133914, #133915, #133916, #133917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133940 Approved by: https://github.com/oulgen	2024-08-20 06:58:11 +00:00
Animesh Jain	e8d3c4be36	[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714 ) Relands https://github.com/pytorch/pytorch/pull/132539 Relands https://github.com/pytorch/pytorch/pull/132736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133714 Approved by: https://github.com/jansel	2024-08-20 05:57:52 +00:00
Bob Ren	f08d484702	Add itertools.islice support in dynamo (#133893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133893 Approved by: https://github.com/oulgen	2024-08-20 05:55:53 +00:00
fduwjj	b6891f4002	[1/N] Refactor fr trace script to make it modulized - config (#133927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133927 Approved by: https://github.com/c-p-i-o	2024-08-20 05:47:17 +00:00
Stonepia	15addb00e6	Update test_control_flow.py to device-agnostic. (#133843 ) Fixes #133841 This PR makes the `test_pointwise_associative_scan_CUDA_flip` also work on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133843 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/malfet, https://github.com/jansel, https://github.com/atalman	2024-08-20 05:05:43 +00:00
Chirag Pandya	994fcb9acd	Killswitch based rollout for flight recorder (#133237 ) Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production. Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere. Differential Revision: D61136320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237 Approved by: https://github.com/c00w	2024-08-20 04:27:55 +00:00
Huamin Li	32f57ac627	[BE] Fix lint issues in qlinear_prepack.cpp (#133797 ) Summary: This diff fixed many lint issues in qlinear_prepack.cpp. I'am fixing them as I want to add more ops/funcs into this file later. Test Plan: Sandcastle Differential Revision: D61425436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133797 Approved by: https://github.com/Skylion007	2024-08-20 04:23:25 +00:00
Avik Chaudhuri	b0bafd2be5	remove tensor weak ref from constraint target (#133890 ) Summary: `_ConstraintTarget` is an internal data structure that has some redundancy: tensors are identified by their id but also carry a weak reference. The weak reference was probably useful a year back but everything is done with ids right now, and the lifetime of these tensors ensures that using their ids is OK. Test Plan: existing tests Differential Revision: D61488816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133890 Approved by: https://github.com/tugsbayasgalan	2024-08-20 03:03:05 +00:00
atalman	188cb5e67b	Bump scikit-image to 0.22.0 (#133932 ) Fixes: https://github.com/pytorch/pytorch/issues/133926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133932 Approved by: https://github.com/malfet	2024-08-20 02:37:16 +00:00
Bin Bao	6c82a1c68c	[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper (#129135 ) Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation. Differential Revision: [D61018114](https://our.internmc.facebook.com/intern/diff/D61018114) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129135 Approved by: https://github.com/angelayi	2024-08-20 02:15:44 +00:00
cyy	c51fc7e98e	Enable clang-tidy in aten/src/ATen/native/nested/ (#133829 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133829 Approved by: https://github.com/Skylion007	2024-08-20 01:52:15 +00:00
chuanqiw	c6ea7b3f21	Update xpu CD used driver to rolling version (#133454 ) The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454 Approved by: https://github.com/atalman	2024-08-20 01:45:45 +00:00
Jane Xu	c7af2728d3	Remove aten dispatch to empty in foreach_norm cuda kernel (#133897 ) Saves significant time on aten dispatch. For 2k tensors, goes from 38ms to 58us. Should shave some overhead mentioned in https://github.com/pytorch/pytorch/issues/133586 Before PR: ![image](https://github.com/user-attachments/assets/7813f059-0f7f-4d44-a9f0-1aaf94ae849f) After: ![image](https://github.com/user-attachments/assets/ad0855b1-2743-432a-ad31-b574c620e2fd) script: ``` import torch # warm up caching allocator a = torch.rand(200, 10, device="cuda") b = torch.rand(200, 10, device="cuda") c = a + b del a, b, c ts = [torch.rand(2, 3, device="cuda") for _ in range(2000)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: torch._foreach_norm(ts) print(p.key_averages().table(sort_by="cpu_time_total")) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133897 Approved by: https://github.com/albanD, https://github.com/drisspg	2024-08-20 01:27:09 +00:00
fduwjj	874ae854eb	[c10d] Land CudaEventCache with roll out flags (#133727 ) @zdevito added a cache for CudaEvent in https://github.com/pytorch/pytorch/pull/122732. And we want to productionize it with a flag in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133727 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-08-20 01:08:00 +00:00
Menglu Yu	cfcb9e388d	[PT2][Optimus] Add move reshape out of split stack pass (#133710 ) Summary: We observed a new pattern in CMF where reshape nodes are in the middle of split stack patter, introducing massive triton_fused_stack_xxx kernels, leading to increased compilation time, we thus move it outside of the pattern, and elimate such split stack nodes. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/2fb51ae7-832e-436b-b6b7-a81599390182 Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173811074971 Network: Up: 10MiB Down: 5.4GiB (reSessionID-96a20105-fdc6-4b4f-b465-813a84a71eba) Jobs completed: 304618. Time elapsed: 25:24.7s. Cache hits: 99%. Commands: 120772 (cached: 120410, remote: 357, local: 5) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 ``` P1529578588 graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1529577762 Counter({'pattern_matcher_nodes': 2123, 'pattern_matcher_count': 1715, 'normalization_pass': 404, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 47, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'unbind_stack_pass': 4, 'batch_sigmoid': 2, 'batch_linear': 2, 'move_reshape_out_of_split_stack_pass': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1}) Trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Ftest%2Fcmf_shrink.Aug_15_10_55_41_trace.json.gz&bucket=pyper_traces The triton_fused_stack_xxx has been reduced significantly, we can see from the trace that the green part becomes smaller {F1806406290} # e2e ads_dper3:68464f2dc5e849ba2670482079cecaaa training_platform:8643db0c3453f2658aa7be7d73974ea0 baseline: f588719502 proposal: f592116164 Differential Revision: D61249205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133710 Approved by: https://github.com/jackiexu1992	2024-08-20 00:50:07 +00:00
Lucy Qiu	6f738d6434	Remove early exit in constant_pad_nd for export (#132679 ) Summary: Remove the early exit for padding when padding = [0, 0, 0, 0]. This prevents export from specializing when all padding=0, allowing export when all padding >= 0. Specialization will still happen for negative padding. This change will be used to export image preprocess for multimodal models, where images of dynamic shape are padded. As images are of dynamic shape, we can't be sure if padding will be required or not. Padding is guaranteed to be non-negative. Preprocess code: https://github.com/pytorch/torchtune/pull/1242 Note: the alternative is to wrap padding in a custom op, which isn't ideal given the custom op will contain the same impl as constant_pad_nd. Test Plan: ci Differential Revision: D60687727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132679 Approved by: https://github.com/ezyang	2024-08-20 00:07:41 +00:00
Ahmad Sarvmeily	9a998d98f1	Fix edge case in inductor triton clean script (#130837 ) The regex in the script is too restrictive, as it excludes examples with parentheses in args, like the following: ``` triton_poi_fused_add_0.run(arg0_1.item(), arg1_1.item(), buf0, 1, grid=grid(1), stream=streamNone) ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130837 Approved by: https://github.com/Chillee	2024-08-19 23:46:11 +00:00
Oguz Ulgen	65b3e42074	Warn on fx graph cache bypass and log it to tlparse (#133826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133826 Approved by: https://github.com/aorenste	2024-08-19 23:39:55 +00:00
Yidi Wu	2ec95ffe57	[cond] support unbacked symbool inputs (#133589 ) Fixes https://github.com/pytorch/pytorch/issues/133577. In dynamo, when received an unbacked symbool input, we create an unbacked symint to replace it. The alternative approach of `not realizing the pred LazyVariable in cond` doesn't work because we need to get the proxy of the symbool input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133589 Approved by: https://github.com/ezyang	2024-08-19 23:36:48 +00:00
Jithun Nair	3f525c9d5d	Upgrade nightly wheels to rocm6.2 - 2 of 2 (binaries) (#133238 ) Depends on https://github.com/pytorch/pytorch/pull/132875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133238 Approved by: https://github.com/atalman	2024-08-19 22:35:33 +00:00
William Wen	2b95007d12	[dynamo] support random.Random (#133725 ) Fixes the observed graph breaks in https://github.com/pytorch/pytorch/issues/121349 and https://github.com/pytorch/pytorch/issues/121350. But there are still graph breaks since a random output is being used as a seed, e.g. ```python import random import torch def fn(x): seed = random.randint(0, 100) rand = random.Random(seed) return x + rand.randrange(10) opt_fn = torch.compile(fn, backend="eager", fullgraph=True) opt_fn(torch.ones(1)) ``` fails with ``` torch._dynamo.exc.InternalTorchDynamoError: UnspecializedPythonVariable() is not a constant ``` when tracing the line ``` rand = random.Random(seed) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133725 Approved by: https://github.com/jansel	2024-08-19 22:34:44 +00:00
James Perng	06faa15194	[pytorch][counters] add pytorch.wait_counter.fx_codgen_and_compile (#133107 ) as titled Differential Revision: [D60876629](https://our.internmc.facebook.com/intern/diff/D60876629/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133107 Approved by: https://github.com/asiab4	2024-08-19 22:29:16 +00:00
Justin Chu	afb3e5ed6a	Add onnx and onnxscript to CI requirements (#133647 ) Add onnx and onnxscript to requirements-ci.txt to allow for `test_public_bindings` and mypy to function when checking `torch.onnx._internal` code as @malfet suggested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133647 Approved by: https://github.com/titaiwangms, https://github.com/kit1980	2024-08-19 22:15:07 +00:00
Xuehai Pan	1fdeb4e329	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778	2024-08-19 22:14:34 +00:00
Xuehai Pan	ff9be0eda9	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769	2024-08-19 22:14:33 +00:00
Xuehai Pan	59ca56e56c	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel ghstack dependencies: #133712	2024-08-19 22:14:33 +00:00
Xuehai Pan	641724ed1d	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-19 22:14:33 +00:00
nowtryz	8de56e2958	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-19 22:13:59 +00:00
nowtryz	14ddd932fd	Add MaskedTensor support to _is_any_true (#128574 ) Fixes #128557 If there is a better way to detect autograd anomalies consistently, feel free to share your ideas. This is a dirty check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128574 Approved by: https://github.com/cpuhrsch	2024-08-19 21:34:31 +00:00
Edward Z. Yang	432638f521	Remove useless environment in reusable workflow (#133659 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133659 Approved by: https://github.com/Skylion007	2024-08-19 20:44:17 +00:00
atalman	d131048056	Change install_triton to do git checkout, apply patch, pip install (#133878 ) Fixes Docker builds: https://github.com/pytorch/pytorch/actions/runs/10458684809/job/28961048777 Follow up after https://github.com/pytorch/pytorch/pull/133694 to apply same patch to Docker build. Change Rather then doing: ``` pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python" ``` We do using 4 step: git clone, git checkout, apply patch, pip install Pull Request resolved: https://github.com/pytorch/pytorch/pull/133878 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-08-19 20:42:50 +00:00
Edward Z. Yang	66d6d8b1b9	Support TORCH_COMPILER_COLLECTIVES envvar (#133696 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133696 Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o	2024-08-19 20:13:04 +00:00
Colin Peppler	0d4eacb9d2	[fake tensor] unbacked symint support for binary op fast path (#133584 ) Addreses https://github.com/pytorch/pytorch/issues/133525 We have an unbacked symint in `final_shape` and it's a tuple... So, add `guard_size_oblivious` to do size oblivious checks + `sym_eq` for list equality. ``` op.shape > torch.Size([1]) final_shape > (u0 + 1,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133584 Approved by: https://github.com/ezyang	2024-08-19 20:03:05 +00:00
Yichen Yan	565e2ea019	Scale XBLOCK in triton for `pointwise` (#133300 ) Adjust https://github.com/pytorch/pytorch/pull/128826 for also `triton_heuristics.pointwise`. An example we encountered during training qwen-7b with rocm 6.1: Note: this kernel also hit the limit of `TRITON_MAX_BLOCK['X']`, shall we increase it from 2048 to 4096? ``` import torch aten = torch.ops.aten inductor_ops = torch.ops.inductor assert_size_stride = torch._C._dynamo.guards.assert_size_stride empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor alloc_from_pool = torch.ops.inductor._alloc_from_pool import triton import triton.language as tl from triton.compiler.compiler import AttrsDescriptor from torch._inductor.runtime import triton_heuristics from torch._inductor.runtime.hints import DeviceProperties @triton_heuristics.pointwise( size_hints=[8589934592], filename=__file__, triton_meta={'signature': {0: 'bf16'}, 'device': DeviceProperties(type='hip', index=2, cc='gfx942', major=None, regs_per_multiprocessor=None, max_threads_per_multi_processor=None, multi_processor_count=None), 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]}, inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_nll_loss_backward_0', 'mutated_arg_names': [], 'no_x_dim': False, 'num_load': 0, 'num_reduction': 0, 'backend_hash': None, 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': False, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'is_hip': True}, min_elem_per_thread=0 ) @triton.jit def triton_(out_ptr0, XBLOCK : tl.constexpr): xoffset = tl.program_id(0).to(tl.int64) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:].to(tl.int64) x0 = xindex tmp0 = 0.0 tl.store(out_ptr0 + (x0), tmp0, None) import triton import triton.language as tl from torch._inductor.runtime.triton_heuristics import grid from torch._C import _cuda_getCurrentRawStream as get_raw_stream if __name__ == "__main__": with torch.cuda._DeviceGuard(2): torch.cuda.set_device(2) buf0 = empty_strided_cuda((32752, 151936), (151936, 1), torch.bfloat16) stream2 = get_raw_stream(2) triton_.run(buf0, grid=grid(4976207872), stream=stream2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133300 Approved by: https://github.com/jansel	2024-08-19 19:41:55 +00:00
drisspg	fb26b84390	Update fused kernels and call _safe_softmax from SDPA (#133882 ) # UPDATE: This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882 Approved by: https://github.com/soulitzer	2024-08-19 18:53:11 +00:00
Shangdi Yu	f1dc3b108a	Back out "[export] fix test for training ir migration" (#133697 ) Summary: Original commit changeset: 0a1cb57e0338 Original Phabricator Diff: D61223356 Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r test_export_rle_model Reviewed By: tugsbayasgalan Differential Revision: D61395818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133697 Approved by: https://github.com/tugsbayasgalan	2024-08-19 18:30:42 +00:00
Edward Z. Yang	a8619c9a1d	Add nitpicker, which allows adding comments to PRs when they match a file pattern (#133861 ) This message would have helped avoid https://www.internalfb.com/sevmanager/view/440895 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133861 Approved by: https://github.com/albanD, https://github.com/izaitsevfb	2024-08-19 18:29:59 +00:00
Jack Zhang	64d9afd8a7	Register nll_loss2d decompositions for core aten (#133534 ) When exporting a training model for Executorch (which requires all ops to be core aten) with cross entropy loss (`torch.nn.CrossEntropyLoss`), we ran into the following error from the fx verifier in `to_edge`: ``` torch._export.verifier.SpecViolationError: Operator torch._ops.aten.nll_loss2d_forward.default is not Aten Canonical. ``` The aten [implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624) of `torch.nn.CrossEntropyLoss` uses `nll_loss2d_forward` for inference and `nll_loss2d_backward` for training, so we need to add the decompositions for both (which already exist) to the list of core aten decompositions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133534 Approved by: https://github.com/JacobSzwejbka	2024-08-19 18:26:48 +00:00
Bin Bao	ad7dda7b32	[CI] Bump up TIMM pin (#133528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133528 Approved by: https://github.com/angelayi	2024-08-19 18:13:57 +00:00
Jack Zhang	773a782249	Decompose _unsafe_index_put into index_put (#133365 ) ## Description Create decomposition of _unsafe_index_put (non-core aten) that turns it into index_put (core aten) ## Testing Phi3 mini + LoRA model successfully passed `to_edge` after failing due to a non-core aten `unsafe_index_put` getting introduced in a decomposition during joint graph calculations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133365 Approved by: https://github.com/pianpwk	2024-08-19 18:07:23 +00:00
Zhengxu Chen	517aee5369	[torchscript] Add a sampled logging integration point. (#133484 ) Test Plan: test script: ``` def test_zhxchen17(self): from libfb.py.pyinit import initFacebook initFacebook() class M(torch.nn.Module): def forward(self, x): return torch.add(x, x) def tmptmp(x, y): return torch.mul(x, y) m = M() n = torch.jit.script(m) print(n(torch.tensor(1))) print(torch.jit.script(tmptmp)(torch.tensor(1), torch.tensor(2))) ``` ``` I0802 12:01:23.932929 4079081 init.cc:407] Logging to scuba: run __torch__.caffe2.test.export.test_export.M.forward sample rate: 1000000 ``` Differential Revision: D60920867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133484 Approved by: https://github.com/davidberard98	2024-08-19 18:04:45 +00:00
Xintong Hu	6564e746ed	[PT2] Port remove_noop to PT2 pre_grad passes (#132183 ) Summary: migrate to aten IR, `reshape` -> `view.default`, not covering `flatten` as there are already optimazation done in PT2, see the example here P1506057533 Differential Revision: D60476525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132183 Approved by: https://github.com/frank-wei	2024-08-19 17:46:51 +00:00
Will Constable	da69a28c6f	[pipelining] Add schedule runtime for lowered schedule (#130488 ) Creates a new runtime that shifts complexity from runtime to ahead-of-time. The existing runtime (PipelineScheduleMulti) accepts a compute-only schedule (forward, backward, weight) actions only are specified, and it infers the communication operations at runtime. Compared to that runtime, PipelineScheduleRuntime has less logic that happens at runtime and relies on lowering passes to transform the compute-only schedule to add communications. Advantages include - easier to verify the correctness by dumping a compute+comm schedule - posible to manually edit the compute+comm schedule if the lowering heuristics are insufficient Functionality included inside the PipelineScheduleRuntime is limited to - accepting a compute-only schedule and lowering it to add comms - executing the compute or comm operations specified by the given schedule - handling work.wait() automatically by calling it just before the matching compute operation (for RECV ops) or at the end of step (for SEND ops) Follow ups for later PRs - Some refactoring should be done to replace PipelineScheduleMulti with this runtime - Optimizer execution is not considered (e.g. for zero-bubble cases) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488 Approved by: https://github.com/H-Huang	2024-08-19 17:44:24 +00:00
PyTorch MergeBot	f31404ba6f	Revert "Update xpu CD used driver to rolling version (#133454 )" This reverts commit 32ed4a3beb746c94c702c80c79c812e45ab3b2f4. Reverted https://github.com/pytorch/pytorch/pull/133454 on behalf of https://github.com/ZainRizvi due to Sorry, there's [an outage](https://github.com/triton-lang/triton/issues/4527) that's preventing triton from being installed correctly, which has the side effect of breaking our docker builds. Reverting this PR since it requires a docker rebuild (which now fails) to give us more time to properly fix the docker builds. ([comment](https://github.com/pytorch/pytorch/pull/133454#issuecomment-2297073937))	2024-08-19 17:28:50 +00:00
Animesh Jain	6ca68357b3	[dynamo] Save class vt in UserDefinedObjectVariable (#133800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133800 Approved by: https://github.com/jansel ghstack dependencies: #133745, #133747, #133746, #133799	2024-08-19 17:21:48 +00:00
Animesh Jain	08f14d5492	[refactor][dynamo][side-effects] Helper function for __new__ for user defined class (#133799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133799 Approved by: https://github.com/jansel ghstack dependencies: #133745, #133747, #133746	2024-08-19 17:21:48 +00:00
drisspg	d6f30b91e5	Add a smaller default config option for decode (#133646 ) ## Before A100 \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 0.461 \| \| \| \| \| \| Max \| 0.996 \| None \| causal \| torch.bfloat16 \| (16, 16, 1, 16, 1024, 64) \| \| Min \| 0.188 \| None \| causal \| torch.bfloat16 \| (2, 16, 1, 16, 512, 128) \| H100 \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 4.528 \| \| \| \| \| \| Max \| 16.710 \| None \| offset \| torch.bfloat16 \| (2, 16, 1, 2, 4096, 64) \| \| Min \| 1.612 \| None \| offset \| torch.bfloat16 \| (16, 16, 1, 16, 512, 128) \| ## After A100: \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 0.472 \| \| \| \| \| \| Max \| 1.110 \| None \| causal \| torch.bfloat16 \| (16, 16, 1, 16, 1024, 64) \| \| Min \| 0.182 \| None \| causal \| torch.bfloat16 \| (2, 16, 1, 16, 4096, 128) \| H100: \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 4.535 \| \| \| \| \| \| Max \| 16.691 \| None \| offset \| torch.bfloat16 \| (2, 16, 1, 2, 4096, 64) \| \| Min \| 1.607 \| None \| offset \| torch.bfloat16 \| (16, 16, 1, 16, 512, 128) \| ### Failing example code ``` Python import torch import torch.nn as nn import functools from torch.nn.attention.flex_attention import flex_attention, create_block_mask class AttentionModel(nn.Module): def __init__(self, initial_kv_len): super().__init__() self.kv_len = initial_kv_len self.q_len = 1 def causal_mask_decode(self, b, h, q_idx, kv_idx): offset = self.kv_len - self.q_len return offset + q_idx >= kv_idx def forward(self, queries, keys, values, mask): self.kv_len = keys.shape[-2] bs, nh, seq_len, _ = queries.shape attention = functools.partial(flex_attention, block_mask=mask, enable_gqa=True) attention = torch.compile(attention) attn_output = attention(queries, keys, values) return attn_output # Driver code def main(): # Set up parameters d_model = 256 q_heads = 32 kv_heads = 8 kv_len = 128 q_len = 1 batch_size = 1 # Initialize the model model = AttentionModel(kv_len) mask = create_block_mask( lambda a, b, c, d: model.causal_mask_decode(a, b, c, d), 1, 1, q_len, kv_len ) # Create sample input tensors queries = torch.randn(batch_size, q_heads, q_len, d_model, device="cuda") keys = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda") values = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda") # Forward pass output = model(queries, keys, values, mask) print(f"Input shapes:") print(f" Queries: {queries.shape}") print(f" Keys: {keys.shape}") print(f" Values: {values.shape}") print(f"Output shape: {output.shape}") if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133646 Approved by: https://github.com/Chillee, https://github.com/joydddd	2024-08-19 17:13:26 +00:00
Mayank Mishra	e37eef8a7b	return state dict without optimized module (#132626 ) Fixes #123625 We should consider changing the current behaviour and make it similar to `1fb498d6e3/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py (L69-L101)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132626 Approved by: https://github.com/williamwen42	2024-08-19 16:58:41 +00:00
PyTorch MergeBot	8d404581fc	Revert "[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 )" This reverts commit 5fab35d77c7d1db7dbb9d5c516254a510b4f4f64. Reverted https://github.com/pytorch/pytorch/pull/132530 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like Dr. CI incorrectly flagged the [pull / linux-docs / build-docs-python-false](https://hud.pytorch.org/pr/pytorch/pytorch/132530#28918577682) failure as being flaky. The job started failing consistently on CI once your PR was merged. [GH job link](https://github.com/pytorch/pytorch/actions/runs/10454830880/job/28949386844) [HUD commit link](`5fab35d77c`) ([comment](https://github.com/pytorch/pytorch/pull/132530#issuecomment-2297001423))	2024-08-19 16:47:15 +00:00
Aaron Orenstein	68fcd54226	Lower cache mocking to test more pytorch code (#133579 ) Summary: Previously we were mocking out FbRemoteFxGraphCacheBackend which meant that we were missing testing a whole bunch of the cache code. Cache at a lower level (CacheClient, LocalAutotuneCacheBackend, ManifoldClient, Redis) so we cover a larger amount of the caching code. Test Plan: unit tests Reviewed By: oulgen Differential Revision: D60937966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133579 Approved by: https://github.com/oulgen	2024-08-19 16:32:36 +00:00
chuanqiw	32ed4a3beb	Update xpu CD used driver to rolling version (#133454 ) The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454 Approved by: https://github.com/atalman	2024-08-19 16:01:47 +00:00
fduwjj	df6831562c	[Flight Recorder] Add more basic analysis to the script (#133412 ) This is the first step to make sure we have a basic function of analyzer for FR in production. - We want to use this script to find out abnormalities in collectives and report it to users. - We also fixed some type errors. - [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412 Approved by: https://github.com/c-p-i-o, https://github.com/atalman	2024-08-19 15:55:00 +00:00
PyTorch MergeBot	76b0284744	Revert "[inductor][cpp] complete vectorization for int32/int64 (#122961 )" This reverts commit 99b3b58f39507bb8ad5b4bb1b9bedf7f47b64fa3. Reverted https://github.com/pytorch/pytorch/pull/122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](`a0ef8888e6`) ([comment](https://github.com/pytorch/pytorch/pull/122961#issuecomment-2296852418))	2024-08-19 15:29:15 +00:00
PyTorch MergeBot	318d3b39c4	Revert "[Inductor][CPP] Support vectorization of load_seed and randn (#130317 )" This reverts commit a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d. Reverted https://github.com/pytorch/pytorch/pull/130317 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](`a0ef8888e6`) ([comment](https://github.com/pytorch/pytorch/pull/130317#issuecomment-2296819045))	2024-08-19 15:13:39 +00:00
Weizhuo Zhang	5153550e4b	[CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836 ) This PR added 3 more accuracy test for AOT inductor CPU side. 1. FP32 dynamic shape accuracy test, torchbench suite 2. AMP static shape accuracy test, torchbench suite 3. AMP dynamic shape accuracy test, torchbench suite Test Time cost: \| Precision \| Shape Type \| Suite \| Time cost \| \|----------- \|------------ \|------------ \|----------- \| \| FP32 \| dynamic \| Torchbench \| 1h40m \| \| AMP \| Static \| Torchbench \| 1h38m \| \| AMP \| dynamic \| Torchbench \| 1h48m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836 Approved by: https://github.com/desertfire	2024-08-19 14:26:48 +00:00
Justin Chu	5fab35d77c	[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 ) 1/n PR to - Move code from torch-onnx from commit `395495e566` into torch.onnx and fixes imports. - Integrate the new export logic with the torch.onnx.export API and include basic set of tests. - Refactor the API for the change. - Improve documentation. Next PRs will be more tests and docs. Fix https://github.com/pytorch/pytorch/issues/129277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-19 14:01:07 +00:00
Jack Taylor	92151c814b	[ROCm] Set _HAS_PYNVML to false if amdsmi not installed (#132990 ) This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed. This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990 Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet	2024-08-19 09:45:58 +00:00
Robert Hardwick	0a976b8899	Enable bf16 float32 mkldnn matmul when float32 precision is 'medium' (#130919 ) This fixes an issue on AArch64 cpus supporting BF16, caused when torch.set_float32_matmul_precision("highest") does not disable the bf16 downconversion in mkldnn_matmul. This was discovered from a unit test failure where the decorator `torch.testing._internal.common_mkldnn.bf32_on_and_off`, which internally switches the float32_matmul_precision between "medium" and "highest" was not having the desired effect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130919 Approved by: https://github.com/jgong5	2024-08-19 09:18:12 +00:00
Laith Sakka	8b6b1721c8	remove StrobelightCompileTimeProfiler.profile_compile_time from stacktrace when strobelight profiling not enabled (#133831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133831 Approved by: https://github.com/oulgen	2024-08-19 09:14:52 +00:00
wz337	4bae7ae3d9	[DeviceMesh][Easy] Fix typo (#133790 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790 Approved by: https://github.com/Skylion007	2024-08-19 05:20:22 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit 2ee6b97464d17fcf4c1fc67c29868fa30d0c16e1. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
CaoE	42e61c783c	[Inductor][CPP] Align Half load with BFloat16 load (#132011 ) Remove `static_cast<float>` for Half load to align with BFloat16. Before: ``` extern "C" void kernel(const half* in_ptr0, half* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = static_cast<float>(in_ptr0[static_cast<long>(x0)]); out_ptr0[static_cast<long>(x0)] = tmp0; } } } ``` After: ``` extern "C" void kernel(const half* in_ptr0, half* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; out_ptr0[static_cast<long>(x0)] = tmp0; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132011 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-19 04:52:39 +00:00
Zain Rizvi	ae00063570	Change default runner's AMI to Amazon 2023 AMI - Part 1 (#133641 ) Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan. This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI This PR will be paired with https://github.com/pytorch/test-infra/pull/5558, which will be merged after this one Pull Request resolved: https://github.com/pytorch/pytorch/pull/133641 Approved by: https://github.com/jeanschmidt	2024-08-19 01:32:25 +00:00
Christopher Yeh	e72e924eb5	Add correct typing annotations to rsample() for all distributions (#133516 ) Fixes #133514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133516 Approved by: https://github.com/Skylion007	2024-08-18 20:31:54 +00:00
eqy	c0c82a5f6a	[CUDA][SDPA] Bump tolerances for `test_mem_efficient_attention_attn_mask_vs` (#133738 ) Same thing as #133051 but for efficient attention CC @drisspg @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/133738 Approved by: https://github.com/drisspg, https://github.com/nWEIdia, https://github.com/Skylion007	2024-08-18 19:14:29 +00:00
Aaron Gokaslan	cf60fe53a8	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-08-18 19:10:16 +00:00
cyy	0d4cedaa47	[13/N] Fix clang-tidy warnings in aten/src/ATen (#133807 ) Follows #133425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133807 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-08-18 17:54:12 +00:00
cyy	47ed5f57b0	[12/N] Fix clang-tidy warnings in aten/src/ATen (#133425 ) Follows #133758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133425 Approved by: https://github.com/ezyang	2024-08-18 11:03:55 +00:00
Yu, Guangye	fbd020fce6	Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738 ) # Motivation This PR aims to add new properties to `_XpuDevicePropertie` for triton gemm optimization. # Additional Context `ext_oneapi_supports_cl_extension` is not a ABI-neutral API. It depends on compiler 2025.0. For more details, see https://github.com/intel/llvm/pull/13212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131738 Approved by: https://github.com/gujinghui	2024-08-18 08:32:30 +00:00
Animesh Jain	fed6096e73	[dynamo] Support object.__new__ call (#133746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133746 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #133745, #133747	2024-08-18 07:18:52 +00:00
Animesh Jain	d56a395971	[dynamo] Support os.fspath (#133747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133747 Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #133745	2024-08-18 07:18:52 +00:00
JackCaoG	27dfd63ee8	remove unnecessary slicing in EffectTokensWrapper (#133737 ) In the cases that `outs ` is a tensor, `[0:]` will cause a nadditional slicing ops that's unnecessary and failed some of XLA's unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133737 Approved by: https://github.com/IvanKobzarev	2024-08-18 05:52:48 +00:00
Simon Fan	d717df2071	[compiled autograd] fix flaky tests due to torch.cuda.memory_allocated() != 0 (#133733 ) FIXES https://github.com/pytorch/pytorch/issues/123949 https://github.com/pytorch/pytorch/issues/124376 torch.cuda.memory_allocated returns the amount of memory allocated in the current process, so if it isn't 0 it means another test didn't properly clean up after itself. I'm keeping the memory check and isolating these tests in subprocess as we don't have a good way to test for activation refcount e.g. https://github.com/pytorch/pytorch/runs/28838386083 ``` _______________ TestCompiledAutograd.test_free_activation_memory _______________ Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/inductor/test_compiled_autograd.py", line 1892, in test_free_activation_memory self.assertTrue(torch.cuda.memory_allocated() == 0) File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue raise self.failureException(msg) AssertionError: False is not true ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133733 Approved by: https://github.com/jansel	2024-08-18 05:43:35 +00:00
cyy	fb9d2dc641	Remove Wno-invalid-partial-specialization from CMake (#133398 ) The code base is clean enough that Winvalid-partial-specialization can be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133398 Approved by: https://github.com/ezyang	2024-08-18 04:06:21 +00:00
cyy	f8cf1829b5	[Reland] [11/N] Fix clang-tidy warnings in aten/src/ATen (#133758 ) Reland of #133298. Remove possible changes that may increase the build time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133758 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-08-17 23:09:44 +00:00
James Wu	0bde3c4f2f	Run cudagraphs on AOTAutograd cache hit (#132294 ) This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit. Specifics: - AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward. - We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time. ``` TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv ``` Twice, once on cache miss and once and cache hit. Here is the perfetto trace for each(FB only link): Cache Miss: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.66 seconds I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 101.24 seconds Average tokens/sec: 147.96 tokens/sec Average bandwidth achieved: 1955.22 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb) Cache Hit: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.67 seconds I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 9.40 seconds Average tokens/sec: 147.94 tokens/sec Average bandwidth achieved: 1954.98 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294 Approved by: https://github.com/eellison	2024-08-17 21:24:54 +00:00
Christophe Bornet	d6368985af	[BE]: Fix setuptools not installed with Python 3.12 (#133561 ) setuptools is not installed correctly for Python 3.12. See https://github.com/python-poetry/poetry/issues/9630#issuecomment-2291114885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133561 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-08-17 17:42:04 +00:00
Felix Janda	b4a1673a67	profiler/unwind: include <dlfcn.h> for dladdr (#133582 ) This fixes a compilation error on linux systems using the musl c library. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133582 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-08-17 16:15:18 +00:00
Jiang, Yanbing	215b14530a	Add Half for sparse.mm reduce (#133672 ) This PR is to add Half support for sparse.mm reduce in CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672 Approved by: https://github.com/Skylion007	2024-08-17 15:20:39 +00:00
Xuehai Pan	1c6fbae579	[Easy][dynamo] fix builtin function names for `itertools` (#133711 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133711 Approved by: https://github.com/Skylion007	2024-08-17 15:12:01 +00:00
leslie-fang-intel	a0ef8888e6	[Inductor][CPP] Support vectorization of load_seed and randn (#130317 ) Summary Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961	2024-08-17 07:15:57 +00:00
leslie-fang-intel	99b3b58f39	[inductor][cpp] complete vectorization for int32/int64 (#122961 ) Summary Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961 Approved by: https://github.com/jansel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-17 07:07:49 +00:00
Huanyu He	d5f6d68d68	[PT2] Resolve PT2 compatility issue in slice and diff (#133740 ) Summary: # context * when running an IG FM training with PT2 we found there are a few graph break due to torch.diff call in [jagged_tensor.py](https://fburl.com/code/cwssxabc) ``` _length: List[int] = ( _length_per_key_from_stride_per_key(torch.diff(offsets), stride_per_key) if variable_stride_per_key else torch.sum(torch.diff(offsets).view(-1, stride), dim=1).tolist() ) ``` * look into the failure, we found the TORCH_CHECK in diff should be TORCH_SYM_CHECK * slice_forward error: df3d7729e, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxXZ2em/index.html) ``` RestartAnalysis Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time. You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs. Could not guard on data-dependent expression ((5u37 + u38)//(u37 + u38)) < 0 (unhinted: ((5u37 + u38)//(u37 + u38)) < 0). (Size-like symbols: u38, u37) ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. Potential framework code culprit (scroll up for full backtrace): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/e99934938a0abe90/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 771, in slice_forward if end_val < 0: ``` * after this diff: [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpAhv2Sh/failures_and_restarts.html) Test Plan: # command * run model ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 ``` * generate tlparse ``` tlparse `ls -t /var/tmp/tt/* \| head -1` ``` Reviewed By: ezyang Differential Revision: D56339251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133740 Approved by: https://github.com/ezyang	2024-08-17 06:07:21 +00:00
Jiong Gong	cd89bf77c8	[inductor][cpp][gemm] easy: adjust indentation of template, var renaming etc. (#133312 ) Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133312 Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel ghstack dependencies: #132729, #132730	2024-08-17 05:49:14 +00:00
Animesh Jain	4dc9795ebf	[refactor][easy] Directly call var_getattr method for PythonModuleVariable (#133745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133745 Approved by: https://github.com/yanboliang	2024-08-17 05:30:01 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
Wanchao Liang	1a4709cef5	[dtensor] add more documentations (#133306 ) This PR adds more documentations to the DTensor APIs, to prepare for the module be public Pull Request resolved: https://github.com/pytorch/pytorch/pull/133306 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337 ghstack dependencies: #133305	2024-08-17 05:09:52 +00:00
Wanchao Liang	addee9f4d1	[dtensor] add missing __all__ to public modules (#133305 ) as titled, some submodules are missing __all__ for API exposures, this PR adds necessary __all__ to those modules, and private some non public APIs explicitly together in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/133305 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337	2024-08-17 05:09:48 +00:00
Masaki Kozuki	702c810780	move param's device check to `_init_group` for fused (#131153 ) There could be some cases where the params have the meta device when calling optimizer's dunder init and those params are materialized in the first computation. This change would allow such situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131153 Approved by: https://github.com/mlazos, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-17 04:49:47 +00:00
Oguz Ulgen	12b8e29203	Add a fudge factor to ephemeral NCCL timeout increase (#133722 ) Differential Revision: [D61422431](https://our.internmc.facebook.com/intern/diff/D61422431) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133722 Approved by: https://github.com/c00w, https://github.com/aorenste ghstack dependencies: #133504	2024-08-17 03:08:40 +00:00
Avik Chaudhuri	695d7db2d6	remove dead code for suggesting legacy dynamic shapes fixes (#133700 ) Summary: `dynamic_dim` based dynamic shapes are long gone, so pretty-printing suggested fixes for them is dead code. Test Plan: existing tests Differential Revision: D61398303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133700 Approved by: https://github.com/zhxchen17	2024-08-17 01:59:34 +00:00
Oguz Ulgen	455f6bda56	Add cache timings info to tlparse (#133504 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json Differential Revision: [D61422432](https://our.internmc.facebook.com/intern/diff/D61422432) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504 Approved by: https://github.com/jamesjwu	2024-08-17 01:37:53 +00:00
Li, Xingyuan	dcfa415e6e	[Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_compiled_optimizers.py` (#133083 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_optimizers.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos	2024-08-17 01:15:26 +00:00
Simon Fan	983bea399d	[compiled autograd] move non-hot path logs into default logger (#133541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133541 Approved by: https://github.com/yf225, https://github.com/bdhirsh ghstack dependencies: #133115, #133148	2024-08-17 00:46:52 +00:00
Simon Fan	0a6cc15079	[compiled autograd] use same graph node names as AOTDispatcher (#133148 ) FIXES https://github.com/pytorch/pytorch/issues/132939 Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html ```python ===== Compiled autograd graph ===== <eval_with_key>.14 class CompiledAutograd(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem: "f32[]cpu" = inputs[0] aot1_primals_1: "f32[4]cpu" = inputs[1] aot1_primals_2: "f32[4]cpu" = inputs[2] aot0_sin: "f32[4]cpu" = inputs[3] aot0_cos: "f32[4]cpu" = inputs[4] getitem_5: "f32[4]cpu" = inputs[5]; inputs = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1) expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]); getitem = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2); aot1_primals_2 = None aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1); aot1_sin_1 = None aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg); aot1_neg = None aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1); aot1_primals_1 = None aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1); aot1_tangents_1 = aot1_cos_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin); aot0_sin = None aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg); aot0_tangents_2 = aot0_neg = None aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos); aot0_tangents_1 = aot0_cos = None aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1); aot0_mul = aot0_mul_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4) accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add); getitem_5 = aot0_add = accumulate_grad_ = None _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub(); _exec_final_callbacks_stub = None return [] ``` where aot1 is ```python class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos() sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2); primals_2 = None neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1); sin_1 = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg); neg = None cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1); tangents_1 = cos_1 = None return (mul_1, mul) ``` and aot0 is ```python class GraphModule(torch.nn.Module): def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos() neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin); sin = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg); tangents_2 = neg = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1); mul = mul_1 = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133148 Approved by: https://github.com/jansel ghstack dependencies: #133115	2024-08-17 00:46:52 +00:00
Simon Fan	4b3ed8bc52	[compiled autograd] log aot id for CompiledFunctionBackward (#133115 ) Partially addresses https://github.com/pytorch/pytorch/issues/132939. Adds the AOT ID after the CompiledFunctionBackward annotation in verbose compiled autograd logging default (no change): https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_xw3ktsi_.log/index.html TORCH_LOGS="compiled_autograd_verbose": https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_gsc9q_43.log/index.html ```python # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) clone: "f32[4]" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None cos: "f32[4]" = torch.ops.aten.cos.default(getitem_1); getitem_1 = None mul: "f32[4]" = torch.ops.aten.mul.Tensor(clone, cos); clone = cos = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) cos_1: "f32[4]" = torch.ops.aten.cos.default(getitem_2) mul_1: "f32[4]" = torch.ops.aten.mul.Tensor(mul, cos_1); mul = cos_1 = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133115 Approved by: https://github.com/jansel	2024-08-17 00:46:52 +00:00
Andrew Gu	b0803129e8	Added meta registration for `_fused_adamw_` (#133728 ) See https://github.com/pytorch/pytorch/issues/123461#issuecomment-2294335273 <img width="1463" alt="Screenshot 2024-08-16 at 5 38 25 PM" src="https://github.com/user-attachments/assets/fe940c0e-775f-4047-bf69-34a3677d539b"> same signature so should be ok to just add the op to the decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/133728 Approved by: https://github.com/janeyx99, https://github.com/fegin	2024-08-17 00:28:31 +00:00
Sam Larsen	ec28121017	[inductor] Fix test_cudagraph_trees_expandable_segments.py for internal (#133698 ) Summary: These tests aren't running internally because the outer test harness is crashing without listing the tests. To fix we need: * Add a target for the tools/stats/ folder since this test imports it * Add a dependence to that target so it's included in the par * Fix up the relative import syntax, which is somehow different internally vs. fbcode (not sure why this works, but many other tests are doing it) Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --run-disabled` Differential Revision: D61396711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133698 Approved by: https://github.com/xuzhao9	2024-08-17 00:09:32 +00:00
leslie-fang-intel	648fc6c9c1	[Inductor][CPP] Refactor the tiling select into a standalone module to enhance its extensibility (#130892 ) Summary After enabling more vectorization, we found that vectorization does not always bring performance benefits. For example, a kernel with several non-contiguous index computations or non-contiguous buffer load/store operations can experience performance regression. A typical case is what we observed in the next PR: after fully enabling vectorization of `index_expr`, we saw a performance regression of `hf_BigBird`. In this PR, we refactor the tiling select into a standalone module to enhance its extensibility for further advanced tiling select heuristic. A standalone class `TilingSelect` with its method `select_tiling` has been added. `select_tiling` accepts the inputs of `fn_list`, `var_sizes_list` and return `tiling_factors`, `tiling_indices`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130892 Approved by: https://github.com/jgong5	2024-08-16 23:55:38 +00:00
Thomas Bohnstingl	d04cd7f3ba	Improvements for associative_scan - Reverse feature (#133011 ) This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `reverse` flag to the `associative_scan` to establish a similar interface as for `jax.associative_scan`. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307. @ydwu4 @Chillee @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133011 Approved by: https://github.com/ydwu4	2024-08-16 23:06:31 +00:00
PyTorch MergeBot	19ff9059eb	Revert "[Inductor][CPP] Support vectorization of remainder (#129849 )" This reverts commit 8624a571b4eecd11547867591d70992843265e97. Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to ptedge_executorch_benchmark build failed again with LLVM crash ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2294408526))	2024-08-16 22:41:05 +00:00
Xu Han	98d6a6eb7d	[inductor] clean up TODO comments. (#133718 ) clean up TODO comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133718 Approved by: https://github.com/henrylhtsang	2024-08-16 22:12:01 +00:00
Justin Chu	271ee90851	[easy] Fix type annotation for `ExportedProgram.run_decompositions` (#133720 ) Fix the tuple type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133720 Approved by: https://github.com/Skylion007	2024-08-16 22:11:42 +00:00
Charles David Hernandez	99e789b52b	[Fix 1/n] GPU Test skips - fbcode/ caffe2/test/quantization (#133158 ) Summary: This diff aims to fix the GPU Test skips in the quantization tests under the `caffe2/test/quantization` directory. The changes made in the `TARGETS` files include adding the `should_use_remote_gpu` flag to enable remote GPU testing. This should help to resolve the skipped tests and improve the overall test coverage. [This diff] Fixed skip count: 4 [Running total] Fixed skip count: 4 Note: Creating separate diffs for each test-group. Test Plan: 281475054644766: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_channel_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/5629499773981783 281475054644780: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_tensor_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/11540474087422107 281475054644853: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quant_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/11540474087422477 844425008078016: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_cuda_quantization_does_not_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/1407375259845199 Differential Revision: D60055277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133158 Approved by: https://github.com/jovianjaison	2024-08-16 22:00:57 +00:00
Menglu Yu	fd33499b0c	[PT2][Optimus] Fix mixed precison training problem in decompose mem bound (#133626 ) Summary: Recently we observed in AI CMF, enabling decompose_mm pass will lead to mixed dtype for aten.mm and aten.addmm errors. By investigation, we figure out that the error comes from torch.sum, which has an implicit type casting to avoid the possible overflow (a similar discussion in github: https://github.com/pytorch/pytorch/issues/115832). Thus we do the output cast to avoid the error. Test Plan: # unit test ``` buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_decompose_mm_mixed_precision ``` Buck UI: https://www.internalfb.com/buck2/00dc168e-4d65-40f8-b169-f4a58206f641 Test UI: https://www.internalfb.com/intern/testinfra/testrun/17169973624867151 Network: Up: 25KiB Down: 44KiB (reSessionID-b7e2ecc7-16ca-476d-95b2-09ea74645eb0) Jobs completed: 19. Time elapsed: 1:07.6s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e ads_dper3:68464f2dc5e849ba2670482079cecaaa training_platform:2c41d916ad5dd82f196372a8c7bd37a0 ### build training_platform ``` buck2 run fbcode//fblearner/flow/projects/training_platform:training_platform ``` ### register training_platform ``` buck2 run mode/opt fblearner/flow/projects/training_platform:workflow -- register-workflows --project-name training_platform --flow_version training_platform:2c41d916ad5dd82f196372a8c7bd37a0 ``` ### build ads_dper 3 ``` fbpkg build -E ads_dper3 --yes --expire 14d ``` ### register ads_dper 3 ``` buck2 run //pyper/core/eval_app_utils:flow_utils_script -- register --pkg-version ads_dper3:68464f2dc5e849ba2670482079cecaaa ``` ### extend package (optional) ``` fbpkg expire --extend-only training_platform:2c41d916ad5dd82f196372a8c7bd37a0 30d ``` ### before fix f591360990 ### after fix baseline f591395056 proposal Differential Revision: D61351815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133626 Approved by: https://github.com/jackiexu1992	2024-08-16 21:53:12 +00:00
Mwiza Kunda	be207af6e1	Disable unwrapping scalar tensors when used as outputs (#132859 ) If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor. Fixes #ISSUE_NUMBER @yanboliang @jansel @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859 Approved by: https://github.com/jansel	2024-08-16 21:40:45 +00:00
Denis Vieriu	861bdf96f4	[MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393 ) Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors. Summary of changes (starting with macOS 15): - Add support for MPS strided API (strides/storage offsets etc): - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc) - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc) - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc) - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc) - Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW). - Add support for strided output buffers (previously we would create a contiguous buffer OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets. --- Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14: ``` - test_train[functorch_maml_omniglot-mps]: 27% faster - test_train[timm_vision_transformer-mps]: 12% faster - test_train[hf_T5-mps]: 9.46% faster ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393 Approved by: https://github.com/albanD Co-authored-by: Siddharth Kotapati <skotapati@apple.com>	2024-08-16 21:07:50 +00:00
Jack Taylor	447f428d6d	[ROCm] Fix text_export cudnn_attention UT (#133234 ) On ROCm we should decompose to flash_attention for sdpa instead of cudnn_attention. Need additional conditionalisation in this code. Issue observed: https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-focal-rocm6.1-py3.8%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=%5B%22export%2Ftest_export.py%3A%3ATestOneOffModelExportResult%3A%3Atest_scaled_dot_product_attention_cuda%22%5D Pull Request resolved: https://github.com/pytorch/pytorch/pull/133234 Approved by: https://github.com/malfet	2024-08-16 20:49:13 +00:00
Will Feng	f57b00704e	[Traceable FSDP2][Dynamo] Support reconstructing CUDA event object within Dynamo graph (#133635 ) `torch.cuda.Event` objects are different from `torch.cuda.Stream` in that events are not pooled, meaning we can't look up a previously created CUDA event object by ID. This prevents CUDA event object created outside of the Dynamo graph from being used within the graph (since Dynamo needs a way to emit a `call_function` line in the graph that does the retrieval of the event object for downstream op use). This PR adds a simple object pool within Dynamo utility, to support looking up CUDA event object by ID from within the Dynamo graph. After this PR, if a user creates a CUDA event object outside of the graph and use that event within the graph, the behavior will exactly match eager. Test commands: - `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_created_outside_of_graph` - `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_across_graph_break` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133635 Approved by: https://github.com/yifuwang ghstack dependencies: #133532, #133531, #133636	2024-08-16 20:40:46 +00:00
Yifu Wang	bc9e20b927	Move the layout constraint registration of aten._scaled_mm.default to module scope (#133669 ) During Inductor lowering, layout constraints for an op is applied before the op's lowering is called. Currently `add_layout_constraint(aten._scaled_mm.default, constrain_to_fx_strides)` is called inside `aten._scaled_mm.default`'s lowering. This means that if the first `_scaled_mm` to be lowered relies on the layout constraint, it won't be applied and the generated code would fail. The issue won't manifest if the first `_scaled_mm` doesn't rely on the layout constraint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133669 Approved by: https://github.com/drisspg, https://github.com/yangsiyu007	2024-08-16 20:30:13 +00:00
Ivan Zaitsev	88ba50279c	Consolidate the format for `--max-acc-splits` flag (#133724 ) fixes the partial export of [lowering] Add max_acc_splits (#133041) ([D60133589](https://www.internalfb.com/diff/D60133589)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133724 Approved by: https://github.com/kit1980	2024-08-16 20:28:55 +00:00
Aaron Gokaslan	3ac527ac5f	[BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687 ) Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library. Copied from commit: New API - Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added. - SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED. Bug Fixes - Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API. - SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node. Enhancements - Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size. - Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph). - Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later. - Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input. - Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph. - JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks. - Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls. - CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details. Samples - Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687 Approved by: https://github.com/eqy, https://github.com/malfet	2024-08-16 20:27:23 +00:00
Ivan Zaitsev	41e6619509	[codemod] Del un at::native::metal @ MPSCNNFullyConnectedOp.h:6 (export D59157302) (#133515 ) Manual export of D59157302 Original description: Removes a using namespace from the global namespace in pursuit of enabling -Wheader-hygiene. Qualifies instances that relied on the using namespace. @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/133515 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-08-16 19:59:07 +00:00
PyTorch MergeBot	a0cb54ab46	Revert "C++ network flow implementation in c10 (#132188 )" This reverts commit e6272acaec63c960486b3ac558d0199cd65d7b97. Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/izaitsevfb due to breaks aps models and builds internally ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2294120234))	2024-08-16 19:48:54 +00:00
atalman	fb59440791	Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds - 2 (#133709 ) Follow up after https://github.com/pytorch/pytorch/pull/133699. 2 more placed where we need to pass these env vars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133709 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2024-08-16 19:41:11 +00:00
Yanbo Liang	678a8f9e66	[Inductor][FlexAttention] Small cleanup for FlexAttention kernel template (#133664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133664 Approved by: https://github.com/drisspg	2024-08-16 19:33:36 +00:00
Siddharth Kotapati	611c104370	[MPS] Add workaround for nonzero with large/complex inputs (#126188 ) Fixes Issue #122916 Resolves correctness issue seen with large inputs to the mps nonzero op by using a different scatter mode. Native nonzero op is still used with smaller inputs for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126188 Approved by: https://github.com/kulinseth, https://github.com/malfet	2024-08-16 19:04:04 +00:00
Oguz Ulgen	0063e56949	Make FX Graph Cache work with distributed training (#133374 ) During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save. Differential Revision: [D61363722](https://our.internmc.facebook.com/intern/diff/D61363722) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374 Approved by: https://github.com/ezyang	2024-08-16 18:51:14 +00:00
Matthias Braun	5ee070266f	Workaround ASAN failure (#133623 ) Summary: ASAN in llvm 17.x and newer reads 8 bytes in front of every function called. This means the JIT must not place a function immediately at the beginning of a freshly `mmap`ed page. This adds an 8 byte sized dummy variable as the first thing to work around the problem. See also: - https://reviews.llvm.org/D148665 - https://github.com/llvm/llvm-project/issues/65253 Test Plan: - `servicelab create cogwheel_adfinder_ubsan_multi_trial_test --local-commit`: https://www.internalfb.com/servicelab/experiment/3701354882 - sandcastle Differential Revision: D61348865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133623 Approved by: https://github.com/Skylion007	2024-08-16 18:48:10 +00:00
cyy	90c3669cd9	Make sure T::is_traceable is bool (#133673 ) Add static_assert to C++ templates in custom_function Pull Request resolved: https://github.com/pytorch/pytorch/pull/133673 Approved by: https://github.com/Skylion007	2024-08-16 18:28:02 +00:00
wz337	eb3d517605	[Test] Add SkipIfRocm to test_grad_acc_cpu_offload (#132975 ) Fixes #123726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132975 Approved by: https://github.com/malfet	2024-08-16 18:26:20 +00:00
rzou	e5baf43b61	[Inductor] short-term fix for needs_fixed_stride_order silent incorrectness (#133452 ) This is a low-risk short-term fix for https://github.com/pytorch/pytorch/issues/128084, for the purposes of 2.4.1. The actual fix for that issue is more risky and we'll target 2.5. needs_fixed_stride_order is silently incorrect with args that are mutable because it creates clones of those args, writes into them, and doesn't update the original args. This PR makes it so that needs_fixed_stride_order doesn't apply to inputs that are being mutated. This PR doesn't completely fix the problem, but it makes it less incorrect: most of the time the input already has the correct strides but inductor fails to recognize it, and in those cases writing directly to the input is fine. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/133452 Approved by: https://github.com/eellison	2024-08-16 18:14:57 +00:00
atalman	caaa339e0f	Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds (#133699 ) BE change. Apply logic simiar to: https://github.com/pytorch/pytorch/blob/main/.github/workflows/docker-builds.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/133699 Approved by: https://github.com/seemethere	2024-08-16 18:10:43 +00:00
PyTorch MergeBot	b833990a8f	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )" This reverts commit 4aa66f68a803927ddd127ceaaa1521b8d6e90e5f. Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))	2024-08-16 18:09:33 +00:00
Bill Yoshimi	4ee65c7e4e	Add message text to BypassFxGraphCache exceptions. (#133505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133505 Approved by: https://github.com/oulgen	2024-08-16 18:02:59 +00:00
Will Feng	1df1d00ffc	[Traceable FSDP2] Remove usage of tuple() generator and simplify code (#133636 ) Dynamo doesn't support `tuple()` generator, and this change also simplifies code a bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133636 Approved by: https://github.com/awgu ghstack dependencies: #133532, #133531	2024-08-16 17:47:28 +00:00
Shunting Zhang	374c61cc82	[inductor] make conv template work with symbolic stride/padding (#132938 ) Fix https://github.com/pytorch/pytorch/issues/132716 The triton template for convolution does not work when the stride or padding contains dynamic shape. Use the hint and add guards to handle that. An alternative is to fallback to eager, but since I've seen the lowering rule for convolution use the hint in other cases, I'll just follow the convention. I don't really know how to add a unit test here since I need create symbolic strides (not strides of a tensor but the stride parameter for convolution) and paddings. I can try harder if reviewer swants me to add unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132938 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #132952	2024-08-16 17:45:12 +00:00
atalman	2cffe82dea	Fix triton build failure due to tritonlang.blob.core.windows.net not available (#133694 ) This should mitigate https://github.com/triton-lang/triton/issues/4527 We should also remove this once our triton pin moves past: https://github.com/triton-lang/triton/pull/4216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133694 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet	2024-08-16 17:34:30 +00:00
Menglu Yu	f735038c8f	[PT2][Optimus] Add unbind_stack_to_slices pass (#133420 ) Summary: We find another pattern to be optimized in AI CMF, thus we add the new pattern Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/b0b9bdf6-1bd1-45db-ba2c-a6892d9d557e Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900285323964 Network: Up: 595KiB Down: 1.7MiB (reSessionID-e527c3b3-03ac-45f8-bd08-3eb9a28b7dc0) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n ``` P1520513078 Counter({'pattern_matcher_nodes': 1756, 'pattern_matcher_count': 936, 'normalization_pass': 280, 'merge_splits_pass': 250, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'unbind_stack_to_slices_pass': 1} # e2e (OBA AFOC) baseline f590253290 proposal f591051921 ### QPS and NE {F1804187079} ### trace analysis baseline trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff590283096-TrainingApplication%2F4%2Frank-1.Aug_12_08_52_03.3628.pt.trace.json.gz&bucket=pyper_traces proposal trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff591081210-TrainingApplication%2F0%2Frank-1.Aug_12_22_23_35.3401.pt.trace.json.gz&bucket=pyper_traces {F1804227687}{F1804227675} Based on the traces, the green part has been shrinked due to optimus transformation. Differential Revision: D61039466 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133420 Approved by: https://github.com/jackiexu1992	2024-08-16 17:30:35 +00:00
Will Feng	6790eb52f9	[Traceable FSDP2] Set torch._dynamo.config.skip_fsdp_hooks to True by default (#133531 ) Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531 Approved by: https://github.com/awgu ghstack dependencies: #133532	2024-08-16 17:18:42 +00:00
Will Feng	6d85077168	[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532 ) Test commands: - `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShard1DTrainingCompose.test_train_parity_with_activation_checkpointing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133532 Approved by: https://github.com/yanboliang	2024-08-16 17:13:47 +00:00
Aleksei Nikiforov	18705e371d	S390x nightly binaries for python 3.13 (#132984 ) Enable building python 3.13 nightly binaries for s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/132984 Approved by: https://github.com/malfet	2024-08-16 17:07:27 +00:00
Yanbo Liang	770086fe39	[Dynamo] Support torch.cuda.device ctx manager (#133385 ) Fixes #128059 I'm not sure if this is the right way, since Inductor doesn't always respect the device id set by users, so probably we should just wrap it as null context manager and print a warning. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @jansel @anijain2305 @mlazos @williamwen42 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133385 Approved by: https://github.com/jansel	2024-08-16 17:05:55 +00:00
Alnis Murtovi	38e5ee1a34	mixed_mm: add more extensive dtype testing (#133292 ) This PR adds a test that tests more combinations of dtypes. The bfloat16 and uint8 combination causes a crash somewhere in triton during the generation of LLVM code. Tests like these would have also prevented segfaults like this one https://github.com/pytorch/pytorch/pull/133173. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133292 Approved by: https://github.com/shunting314	2024-08-16 16:49:27 +00:00
Shivam Raikundalia	9c2d119194	[Profiler/CPU] Add API for Dynamic Activity Toggling [3/n] (#133353 ) Summary: In this diff, we add the CPU activity implementation of being able to dynamically toggle profiling in between steps. To do this we remove the callbacks for Torch Ops and add them back in when an enable call is made. This diff also adds some support code for doing the same in python; however, the python stack comes with its own set of compilcations when enabling this feature. For one, we get into a scenario where the python stack during the toggle never gets an exit as it the tracing gets turned off which makes for some tricky post processing. For this reason, we can leave the python dynamic toggling off for now and revisit if there is enough demand. Test Plan: Got the following tracing by disabling torch and cuda ops: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Aug_13_13_03_02.606577.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D61221497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133353 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-08-16 16:36:57 +00:00
Shuqiang Zhang	46af996ce7	[c10d] Do not call ncclCommAbort if comm is not initialized (#133630 ) Summary: We saw ncclCommAbort was called and hang during the NCCLComm:create. If NCCL comm is not properly initialized, ncclCommAbort behavior is 'undefined', avoid calling it would allow the process to properly throw exception Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/133630 Approved by: https://github.com/wconstab	2024-08-16 16:25:07 +00:00
Alnis Murtovi	8b8b4e5ae9	AutoHeuristic: documentation for mm (#133611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714, #133608	2024-08-16 16:20:38 +00:00
Alnis Murtovi	0e0077f3b6	AutoHeuristic: mm ranking heuristic h100 (#133608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714	2024-08-16 16:20:38 +00:00
Alnis Murtovi	e51c8ad369	AutoHeuristic: Heuristic that ranks choices for mm (#131714 ) This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included. Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710	2024-08-16 16:20:38 +00:00
Aaron Gokaslan	51e13745be	[BE]: Update ruff to 0.6.0 (#133609 ) Updates ruff and fixes a couple false negatives it discovered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133609 Approved by: https://github.com/malfet	2024-08-16 14:11:01 +00:00
Jiong Gong	eca8b4220f	[inductor][cpp][gemm] fix k-slicing bug and add thread blocking config (#132730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132730 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #132729	2024-08-16 13:50:19 +00:00
atalman	a6aa451bde	Move python 3.8 to 3.9 for linux-binary-manywheel workflow (#133621 ) Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133621 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet	2024-08-16 13:49:26 +00:00
PyTorch MergeBot	e1b9b89d94	Revert "[Flight Recorder] Add more basic analysis to the script (#133412 )" This reverts commit fcc2fc1a70c35628939611b496b209fa0a1d19bf. Reverted https://github.com/pytorch/pytorch/pull/133412 on behalf of https://github.com/atalman due to New test: distributed/flight_recorder/test_fr_analysis is constantly failing ([comment](https://github.com/pytorch/pytorch/pull/133412#issuecomment-2293506539))	2024-08-16 13:26:25 +00:00
Isuru Fernando	b444343087	Fix printing symfloat pow in triton (#133614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133614 Approved by: https://github.com/Skylion007	2024-08-16 13:08:29 +00:00
Wu, Chunyuan	762b1b4c17	[inductor] [cpp] fix accuracy when template_buffer has users other than the epilogue nodes (#133073 ) This PR fixes the accuracy issues when template_buffer has users other than the epilogue nodes. This will fix the accuracy failure of the below models using max-autotune: - MobileBertForMaskedLM - MobileBertForQuestionAnswering - convnext_base - swin_base_patch4_window7_224 ## Issue 1: Previously we always add `template_buffer` as an alias of `Y`. In case the `template_buffer` has users other than the epilogue nodes, we shouldn't set it as an alias of `Y`. This PR adds the check in such case. Wrong code before the fix where `tmp4` and `tmp9` are both stored to `Y` while we need 2 different buffers for them since `tmp4` will be used by nodes other than the epilogue node: ```cpp Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp4; // tmp4 is the output of the template Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp9; // tmp9 is the output of the epilogue node ``` Correct code after the fix: ```cpp out_ptr2[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp4; Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp9; ``` ## Issue 2: When fixing the above issue, we found that there's correctness issue when `bias` is `False`. The root cause is that in the case where `bias` is `False`, the `template_buffer` has users other than the epilogue nodes and the GEMM output buffer is localized, we need to add an extra copy epilogue to ensure that the GEMM output (a local buffer) is stored to the `template_buffer` that will be used later by other nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133073 Approved by: https://github.com/jgong5 ghstack dependencies: #133070	2024-08-16 12:13:10 +00:00
Nicolas Macchioni	dd69013c7a	deprecate `search_autotune_cache` (#133628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133628 Approved by: https://github.com/oulgen	2024-08-16 09:29:39 +00:00
Nicolas Macchioni	15183f5ebf	overestimate `time_taken_ns` for autotuning (#133633 ) tldr; in `autotune_to_one_config` we now include the precompile time, and in coordesc tuning we include the time from `autotune_to_one_config`, since this is a precursor Pull Request resolved: https://github.com/pytorch/pytorch/pull/133633 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-08-16 09:28:49 +00:00
Oguz Ulgen	30fbf5b19c	Remove AMD restrictions on triton hashing (#133616 ) Summary: When we added these functions, AMD's triton checkout was very old, it appears to have caught up. Remove restrictions. Test Plan: unit tests Differential Revision: D61351473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133616 Approved by: https://github.com/mxz297, https://github.com/nmacchioni, https://github.com/eellison	2024-08-16 08:02:48 +00:00
Avik Chaudhuri	5ed3b70d09	remove redundant upper bound check at runtime (#133627 ) Summary: Some symbols (unbacked symints?) can have upper bound that is `sys.maxsize - 1` but our code for runtime assertions assumes that such upper bounds would come in as `sympy.oo` (like backed symints?) in order to drop them. So we weren't dropping them, which this PR fixes. Test Plan: added test Differential Revision: D61352056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133627 Approved by: https://github.com/SherlockNoMad	2024-08-16 06:57:12 +00:00
angelayi	f64146aff0	Update source matcher to use torch_fn (#133642 ) Updating the source matcher to also accept pattern matching on the torch_fn metadata, which exists in both strict and non-strict export. We want to replace the use of source_fn_stack with torch_fn, as it's not possible for us to get source_fn_stack in non-strict export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133642 Approved by: https://github.com/ydwu4	2024-08-16 06:42:52 +00:00
Aleksandar Samardžić	d12bbcd785	Add auto-tuning for sparse semi-structured MM operator (#123742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123742 Approved by: https://github.com/kadeng	2024-08-16 06:40:24 +00:00
Max Podkorytov	3d45717219	[ROCm][CK][Inductor] enable dynamic shapes for CK backend to gemm max autotune (#133285 ) This PR enables dynamic shapes for the CK backend for gemm max autotune (see #125453). This is achieved via unhardcoding the problem sizes from the template body and passing them as parameters instead. We handle passing the problem sizes for the kernel call as well as for the benchmark call. # Testing `pytest test/inductor/test_ck_backend.py [-k dynamic]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133285 Approved by: https://github.com/ColinPeppler	2024-08-16 06:05:23 +00:00
Menglu Yu	8ea5b572a6	[PT2][Optimus] Add missing example value for the nodes introduced in group batch fusion (#133414 ) Summary: Recently we observed more missing example values in nodes introduced in Optimus, which causes problem to have further optimization when this node info needs to be used. Thus we add the meta for these nodes in the diff. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/c0ad506f-ce9d-4b80-947a-cb79074b72f0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2251800058834808 Network: Up: 1.4GiB Down: 2.0GiB (reSessionID-fb781425-f29b-44b5-8a5b-daffe7274f86) Jobs completed: 300289. Time elapsed: 13:19.5s. Cache hits: 99%. Commands: 119360 (cached: 118494, remote: 824, local: 42) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 ``` P1520691492 Differential Revision: D61039772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133414 Approved by: https://github.com/jackiexu1992	2024-08-16 04:52:16 +00:00
Animesh Jain	8a2b064236	[dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806 ) Fixes https://github.com/pytorch/pytorch/issues/132551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806 Approved by: https://github.com/williamwen42	2024-08-16 04:30:06 +00:00
fduwjj	fcc2fc1a70	[Flight Recorder] Add more basic analysis to the script (#133412 ) This is the first step to make sure we have a basic function of analyzer for FR in production. - We want to use this script to find out abnormalities in collectives and report it to users. - We also fixed some type errors. - [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412 Approved by: https://github.com/c-p-i-o	2024-08-16 03:53:12 +00:00
Shangdi Yu	d9f17cf4e4	[fx] Do not add Proxy on Tensor (#133470 ) Summary: Switch to set_proxy_slot instead of set the proxy directly on the Tensor. We do not want to add Proxy to tensor objects, because Proxy cannot be deepcopied or pickeled and can cause problems when users want to deepcopy or pickle models. Test Plan: CI Differential Revision: D61277650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133470 Approved by: https://github.com/zou3519	2024-08-16 03:39:50 +00:00
Animesh Jain	8a5708ba3d	[dynamo] Support object creation of classes with custom __new__ (#132977 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132977 Approved by: https://github.com/jansel	2024-08-16 03:09:23 +00:00
Angela Yi	a1a869f2f5	[ts_converter][reland] Add support for LinearOpContext and Conv2dOpContext in quantization pass (#133622 ) Summary: Reland of D60871242 Test Plan: CI Differential Revision: D61352600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133622 Approved by: https://github.com/SherlockNoMad	2024-08-16 01:55:45 +00:00
Nikita Shulga	1653f7786d	Fix type promotion for `ldexp` (#133519 ) According to the documentation, ldexp of half and int should return half tensor and ldexp of double should not overflow for 64-bit exponent Introduce `_pow2` helper function that does not follow scalar to float32 promotion pattern if `self` is reduced precision float or double Add regression tests to `test_ldexp` and enable it to run on both CPU and GPU Fixes https://github.com/pytorch/pytorch/issues/133267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133519 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2024-08-16 01:26:26 +00:00
Alnis Murtovi	3a904d1163	AutoHeuristic: Enable explicit support for ranking (#131710 ) This PR adds support for heuristics that rank choices in AutoHeuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131710 Approved by: https://github.com/eellison ghstack dependencies: #131705	2024-08-16 01:20:52 +00:00
Alnis Murtovi	add0f0085c	AutoHeuristic: Support ranking/pruning choices (#131705 ) This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705 Approved by: https://github.com/eellison	2024-08-16 01:20:52 +00:00
cyy	929d2f8253	[3/N] Fix clang-tidy warnings in torch/csrc/autograd (#133389 ) Follows #133295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133389 Approved by: https://github.com/Skylion007	2024-08-16 00:57:54 +00:00
Jiong Gong	c22f51ce7c	[inductor][cpp][gemm] improve large bs perf with better cache blocking (#132729 ) Improve the cache blocking by reducing Mc_blocks to make A reside in L2 and reused by B as much as possible. This improves large bs perf for both scenarios: 1) N is large and K is of medium sizes; 2) K is large. Different strategies are used to handle these scenarios. Check the notes in `get_cache_blocking` in the changes. Measured with 56-core Intel (R) Xeon (R) CPU Max 9480, jemalloc 5.1 and intel omp, bf16. Run with code cache of B matrix (weights). Model Shapes \| Before Optimization \| After Optimization \| Speedup \| onednn linear \| Speedup over onednn -- \| -- \| -- \| -- \| -- \| -- M=1024, N=12288, K=4096 (Llama2-8b) \| 5.69 ms \| 3.71 ms \| 1.53 \| 4.53 ms \| 1.22 M=1024, N=4096, K=4096 (Llama2-8b) \| 1.69 ms \| 1.63 ms \| 1.04 \| 2.05 ms \| 1.26 M=1024, N=22016, K=4096 (Llama2-8b) \| 10.32 ms \| 6.57 ms \| 1.57 \| 8.46 ms \| 1.29 M=1024, N=4096, K=11008 (Llama2-8b) \| 5.21 ms \| 3.26 ms \| 1.60 \| 4.65 ms \| 1.43 M=1024, N=5120, K=4096 (Llama3-8b) \| 1.99 ms \| 1.78 ms \| 1.12 \| 2.31 ms \| 1.30 M=1024, N=28672, K=4096 (Llama3-8b) \| 13.41 ms \| 8.56 ms \| 1.57 \| 10.96 ms \| 1.28 M=1024, N=4096, K=14336 (Llama3-8b) \| 6.93 ms \| 4.31 ms \| 1.61 \| 6.24 ms \| 1.45 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132729 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel	2024-08-16 00:57:51 +00:00
cyy	8f7cf796ea	[14/N] Use std::optional (#133417 ) Follows #132527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133417 Approved by: https://github.com/ezyang	2024-08-16 00:48:34 +00:00
Mikayla Gawarecki	d9576c9440	Fix failures when default is flipped for weights_only (#127627 ) Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627 Approved by: https://github.com/albanD ghstack dependencies: #132349	2024-08-16 00:22:43 +00:00
Mikayla Gawarecki	c8ad5e37e8	Fix all RuntimeErrors during weights_only load from being erroneously reported with the weights_only message (#132349 ) Caught in above PR #127627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132349 Approved by: https://github.com/albanD	2024-08-16 00:22:43 +00:00
Shangdi Yu	0d2be06d94	[export] fix test for training ir migration (#133587 ) Summary: Fix quantization pass to be compatible with the new export IR. Some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. Test Plan: CI buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r export_rle_model Differential Revision: D61223356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133587 Approved by: https://github.com/tugsbayasgalan	2024-08-15 23:55:09 +00:00
eqy	7ad3108ef2	[CUTLASS][FP8] Skip scaled_mm rowwise test on sm89 (#133612 ) Rowwise implementation currently uses sm90-specific features incl. TMA CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/133612 Approved by: https://github.com/Skylion007	2024-08-15 23:43:30 +00:00
Xintong Hu	413416cf33	[PT2] Consolidate args and kwargs usage in pre_grad passes (#133518 ) Summary: with acc_tracer disabled, the nodes generated use `args` instead of `kwargs` like before, in the current passes there are a mixed usage of `args` and `kwargs` and normalize nodes to switch between them can cause following passes to work/not work, in this diff we create a pass to normalize all the nodes to use `kwargs` at the beginning and changed all the passes to follow the same Reviewed By: frank-wei Differential Revision: D61049898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133518 Approved by: https://github.com/frank-wei	2024-08-15 23:41:39 +00:00
Josh Fromm	f347174d61	Hipify Pytorch3D (#133343 ) Summary: X-link: https://github.com/fairinternal/pytorch3d/pull/45 X-link: https://github.com/facebookresearch/pytorch3d/pull/1851 Very minor change to extend hipification to a missing hipcub constant. This is needed to hipify some of the kernels in pytorch3d. Differential Revision: D61171993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133343 Approved by: https://github.com/houseroad	2024-08-15 23:39:07 +00:00
Angela Yi	29c4b4ea5a	[executorch] Refactor delegation code (#132773 ) Summary: Refactoring partitioner-based delegation to prepare for allowing buffer mutations in the delegate (following diff). Test Plan: CI Differential Revision: D60813405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132773 Approved by: https://github.com/ydwu4, https://github.com/cccclai	2024-08-15 22:52:12 +00:00
Andrew Gu	86aa327e4a	[FSDP2] Added eager fast-path for fp32->bf16 param cast (#133369 ) Some recommendation models have a high number of `nn.Parameter`s. This exacerbates per-tensor CPU overheads in FSDP2 compared to FSDP1. This PR adds a fast-path for the common bf16/fp32 mixed precision case for the casting the parameters from fp32 to bf16 to reduce CPU overhead and possibly have more efficient copy. - Old: `for` loop + `.to(torch.bfloat16)`, incurring dispatcher overhead per parameter - New: `torch.empty` + `torch.split` + `torch._foreach_copy_`, incurring three dispatches --- Example on Llama3-8B which does not have many `nn.Parameter`s (compared to recommendation models): (Old) on Llama3-8B (0.46 ms CPU overhead for all-gather): ![Screenshot 2024-08-13 at 6 19 39 PM](https://github.com/user-attachments/assets/e6390e9f-ee54-4208-9d60-9451a4142efa) (New) on Llama3-8B (0.37 ms CPU overhead for all-gather): ![Screenshot 2024-08-13 at 6 20 32 PM](https://github.com/user-attachments/assets/a5dc1d38-53d2-4984-b3cc-85ce5a538ede) --- Same example as above but now with float8 all-gather: (Old) on Llama3-8B with float8 (0.996 ms CPU overhead for all-gather): ![Screenshot 2024-08-15 at 11 27 46 AM](https://github.com/user-attachments/assets/2b7e9c9c-56ea-4375-851e-a2a704689d8d) (New) on Llama3-8B with float8 (1.014 ms CPU overhead for all-gather): ![Screenshot 2024-08-15 at 11 26 33 AM](https://github.com/user-attachments/assets/160cf8f6-bb97-4633-b802-baeae74e3262) The times are relatively comparable for float8 with the new one possibly slightly slower, but this is mainly because for Llama's transformer blocks, there are only two norm weights that need to cast to bf16. These screenshots are mainly to show that the optimization still works in the mixed case. Differential Revision: [D61236983](https://our.internmc.facebook.com/intern/diff/D61236983) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133369 Approved by: https://github.com/weifengpy ghstack dependencies: #133498	2024-08-15 22:27:20 +00:00
Edward Z. Yang	90d2593b3e	Revert #132806 , #132736 , #132539 , #132487 (#133570 ) This reverts commit 25df063f044202899ab92d6f3d77950af5de482f. This reverts commit de00c7958301ce81b9716bdef5731ed40d4d14ca. This reverts commit 419b76c4ac80c8b1c95120cd52db622333a3a688. This reverts commit bc57d5b6ff8725bbe93f0e67db72459720c750cf. Differential Revision: [D61335013](https://our.internmc.facebook.com/intern/diff/D61335013) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133570 Approved by: https://github.com/albanD, https://github.com/jansel, https://github.com/anijain2305	2024-08-15 20:54:21 +00:00
Angela Yi	5f1470d45d	[export] Add InterpreterModule to trace_rules (#132949 ) Summary: Added InterpreterModule to trace_rules so that it can be torch.compiled. Fixes https://github.com/pytorch/pytorch/issues/132921 Test Plan: CI Differential Revision: D60426372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132949 Approved by: https://github.com/zhxchen17	2024-08-15 20:46:13 +00:00
Sherlock Huang	09a489b177	Fix serialization for tensor list output (#133539 ) Summary: Some element of tensor list output doesn't not have a user. In such case, create a name as `{node_name}_unused_{index}` for it. Test Plan: OSS CI Differential Revision: D61309011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133539 Approved by: https://github.com/zhxchen17	2024-08-15 20:31:44 +00:00
Zain Rizvi	cdf217cda1	Disable distributed nccl tests to unblock Amazon2023 ami upgrade (#133355 ) These tests keep failing on the Linux Amazon 2023 AMI. The distributed team is looking into them, but until then, disabling the tests in order to unblock the AMI upgrade Examples of the failures: Failure 1: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175 ``` FAILED [90.0880s] distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False - AssertionError: None mismatch: None is not -6 ``` Failure 2: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963494 ``` ____ NCCLTraceTestTimeoutDumpOnStuckRanks.test_timeout_dumps_on_stuck_ranks ____ Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 4214, in test_timeout_dumps_on_stuck_ranks self.assertEqual(self._wait_process(0, timeout=90), -6) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3721, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: None mismatch: None is not -6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133355 Approved by: https://github.com/kit1980, https://github.com/wconstab	2024-08-15 20:15:00 +00:00
wz337	161cc137d2	[DTensor] Add naive replicate strategy for aten.triu.default and aten.tril.default (#133545 ) Shampoo uses triu and tril [here](https://github.com/facebookresearch/optimizers/blob/main/matrix_functions.py#L63). As the matrix input is replicated, we register the naive replicate strategy to unblock. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133545 Approved by: https://github.com/awgu	2024-08-15 20:05:03 +00:00
Edward Z. Yang	99cf567714	Make SCRIBE_GRAPHQL_ACCESS_TOKEN available to test jobs running on main (#133536 ) It is possible to write to Meta's internal in-memory database Scuba via the Scribe Graph API: https://www.internalfb.com/intern/wiki/Scribe/users/Knowledge_Base/Interacting_with_Scribe_categories/Graph_API/ This is currently being used by pytorch/benchmark repo to upload torchbench performance results. I want to make this API generally available to all jobs running on CI in a semi-trusted context. To talk to Scribe, you need a secret access token. I have initially configured an environment prod-branch-main which contains `SCRIBE_GRAPHQL_ACCESS_TOKEN`, and switched a single class of jobs (linux-test) to use this environment when they are running on the main branch. Because we require approvals for running CI on untrusted contributions, we could potentially allow all jobs to run in this environment, including jobs on PRs, but I don't need this for my use case (per-PR benchmark result reporting, and miscellaneous statistics on main.) If this works, I'll push out this environment to the rest of our test jobs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133536 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/albanD	2024-08-15 19:53:17 +00:00
Alnis Murtovi	5dfb22d4c8	AutoHeuristic: tests (#133496 ) This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496 Approved by: https://github.com/eellison	2024-08-15 19:22:44 +00:00
laithsakka	7673ee5456	remove benchmarks/__init__.py (#133390 ) trying to address https://github.com/pytorch/pytorch/issues/133377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang	2024-08-15 19:08:10 +00:00
Klaus Strobl	dff388491b	Fix docs for L1Loss and MSELoss (#133501 ) The total number of elements is `N` not `n`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133501 Approved by: https://github.com/mikaylagawarecki	2024-08-15 18:56:55 +00:00
cyy	27538671ae	Enable clang-tidy coverage on torch/*.h (#133422 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133422 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-08-15 18:52:08 +00:00
Eddie Yan	4aa66f68a8	[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 ) Unblocks/unbreaks against newer CUTLASS (3.5+) CC @nWEIdia @xwang233 @ptrblck @thakkarV Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493 Approved by: https://github.com/Skylion007	2024-08-15 18:33:22 +00:00
Chirag Pandya	41d6cabca1	[c10d]Control logging c++ traces with a flag (#133490 ) Summary: Logging C++ stack traces occasionally races with shutdown processes on exception. It isn't safe and we've seen SIGSEGVs in the field. These crashes prevent flight recorder dumps from completing. For now, default this dumping to `true` and provide a knob if we need to control things in production. Test Plan: Tested locally on a job named `torchx-chirag_test_run` to make sure that the JK was honored by the code. It was correctly disabled on my test job. see (TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0) below. ``` ] [trainer2]:I0814 11:21:20.152419 3708 ProcessGroupNCCL.cpp:874] [PG ID 0PG GUID 0 Rank 10] ProcessGroupNCCL environments: NCCL version: 2.20.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 0, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 2000, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0 ``` Differential Revision: D61283335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133490 Approved by: https://github.com/fduwjj	2024-08-15 18:25:02 +00:00
Jean Schmidt	546c53b784	Bump max runners for linux.24xlarge to 500 (#133569 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133569 Approved by: https://github.com/ZainRizvi	2024-08-15 18:02:46 +00:00
Zhengxu Chen	59b3f5911d	[sigmoid] Support custom obj deserialization. (#133463 ) Summary: It seems we have multiple places deserializing torchbind objects. Moving the code around so that every load essentially share the same implementation. Also added a test case "package_reader_testing" which load back the archive file in Python and eagerly validate the numerical result. Test Plan: buck test mode/opt sigmoid/inference/test:e2e_test_cpu Reviewed By: SherlockNoMad Differential Revision: D61235770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133463 Approved by: https://github.com/ydwu4	2024-08-15 17:58:44 +00:00
Guilherme Leobas	5ec9c0bc4a	Fix `linearize(grad(...))` call (#133364 ) Fixes #124550 Also moves `graph.eliminate_dead_code()` call to a few lines after `_inline_module(...)` in `const_fold.py` * Test plan: Add a new test on `test_eager_transforms.py` to ensure the reported issue was indeed fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/133364 Approved by: https://github.com/zou3519	2024-08-15 17:55:36 +00:00
PyTorch MergeBot	cfec69e2a1	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))	2024-08-15 17:55:07 +00:00
Shangdi Yu	d3b458e603	[export] Do not use export.export for `capture_pre_autograd_graph` (#133370 ) Summary: Do not use export.export for `capture_pre_autograd_graph` in unittests anymore. #buildall Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D60996041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133370 Approved by: https://github.com/tugsbayasgalan	2024-08-15 17:37:45 +00:00
Aart Bik	2236194c6b	[traced-graph][sparse] cleanup test guards (#133375 ) Rather than repeating the same guard for every test, simply express it once on the test fixture instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133375 Approved by: https://github.com/ezyang	2024-08-15 17:32:06 +00:00
fduwjj	a7c6e30a3f	[c10d][ez] Add space between PG ID and PG UID (#133497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133497 Approved by: https://github.com/shengbao-zheng, https://github.com/wz337	2024-08-15 17:20:12 +00:00
Mikayla Gawarecki	018e48c337	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 ) Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489 Approved by: https://github.com/albanD	2024-08-15 17:11:52 +00:00
Jane Xu	c23dceb8f1	Add Adafactor foreach impl (#132336 ) This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR: - we have a foreach flag for Adafactor - It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency. Next steps: - make torch.compile possible on it - make it faster (by adding more foreach apis) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336 Approved by: https://github.com/albanD ghstack dependencies: #133360	2024-08-15 17:00:33 +00:00
Chien-Chin Huang	3434a54fba	[CP] Rewrite ring attention backward algorithm and enablement APIs (#131351 ) What does this PR achieve 1. This PR rewrite ring attention backward algorithm to fuse the alltoall and overlap the gradient communication with computation. 2. Enables memory efficient attention with CP by templating the ring attention backward to verify the accuracy as fp32 gives us higher confident about the implementation correctness. 3. Provides some experimental APIs to enable context parallelism. 4. Ensures CP work with torch.compiler. The combination of causal masking and torch.compiler has not yet worked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131351 Approved by: https://github.com/wanchaol	2024-08-15 16:41:51 +00:00
Isuru Fernando	7470ae85e4	Fix triton codegen with math.trunc (#133354 ) Fixes https://github.com/pytorch/pytorch/issues/133172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133354 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-08-15 16:38:26 +00:00
akshay-raj-dhamija	fc5aa24a6e	Rewording doc string for clip_grad_norm_ (#133406 ) The doc string for `torch.nn.utils.clip_grad_norm_` needed some clarity, it was earlier unclear that the norm was being computed over the norms of individual gradient parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133406 Approved by: https://github.com/mikaylagawarecki	2024-08-15 16:21:27 +00:00
Pian Pawakapan	a75248528f	[export] refactor _process_dynamic_shapes (#133391 ) Sorryyyyy for another refactor. This splits `_process_dynamic_shapes` into 3 parts: 1. `_combine_args` - mostly the same thing 2. `_check_dynamic_shapes`, which is responsible for raising 99% of UserErrors if the dynamic shapes spec is invalid (minus 1 UserError with DerivedDims) 3. `_process_dynamic_shapes`, which for now, is the same thing, minus the stuff in 2. This refactor is helpful for incoming automatic dynamic shapes work, because, we're switching to `assume_static_by_default=False`, which is what `_dynamo.export` currently does. This means any unspecified dims are allocated a symbol, in contrast to export today which keeps unspecified dims static. Historically this has been desirable - export users don't want too much dynamism. So we want to change how the spec is translated into constraints. This means when we switch over to automatic dynamic shapes, we want to plug in something in between steps 2. and 3. which patches up the spec for `assume_static_by_default=False`, filling in static shapes for any unspecified dims, and potentially clearing out the auto-dynamic dims (since they're no-ops). We would do this in-between 2. and 3. to keep `_process_dynamic_shapes` semantically the same, since it's used with `_dynamo.export`. We could do this without a refactor, plugging in this transform before `_process_dynamic_shapes`, but since that function's responsible for both spec checking + constraint production, moving spec checking to before we transform the specs helps guarantee we're raising errors on what the user's specified, and not an internal export bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133391 Approved by: https://github.com/avikchaudhuri	2024-08-15 16:21:21 +00:00
Aleksandar Samardžić	dd6ce2fe7c	Restore mixed dtypes GEMM auto-tuning for Ampere (#129058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129058 Approved by: https://github.com/kadeng	2024-08-15 15:56:09 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Justin Chu	57d1ffc512	Ignore `torch.onnx._internal` in `test_circular_dependencies` (#133110 ) Ignore the whole `_internal` module as code will depend on onnxscript and onnx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133110 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-15 15:37:24 +00:00
hippocookie	a6ad834fa8	Fix counting execution time in run_test.py (#133199 ) Counting `elapsed_time` immediately after `start_time`, not reflect real execution time of `test_batch`. Move `elapsed_time` and print method after `run_tests` method call to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133199 Approved by: https://github.com/clee2000	2024-08-15 15:29:44 +00:00
Aaron Gokaslan	ec49ce5f8e	[CUDA]: Add frexp CUDA bfloat16 support (#133313 ) Fixes #133263 Add CUDA bfloat16 support to cuda_frexp Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-08-15 15:20:00 +00:00
Andrew Gu	59e33cd1f4	[FSDP2] Set `ctx.set_materialize_grads(False)` for post-backward (#133498 ) https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.set_materialize_grads.html This avoids unnecessarily `aten::zeros` for the inputs in the post-backward custom autograd backward. We do not need the gradient values for the post-backward logic. Differential Revision: [D61291210](https://our.internmc.facebook.com/intern/diff/D61291210) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133498 Approved by: https://github.com/weifengpy	2024-08-15 14:58:26 +00:00
PyTorch MergeBot	07adae3dac	Revert "Make FX Graph Cache work with distributed training (#133374 )" This reverts commit dcdb25453e0ddc6a83e0052fffc544d4d03cdffd. Reverted https://github.com/pytorch/pytorch/pull/133374 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))	2024-08-15 13:43:16 +00:00
PyTorch MergeBot	32d890745d	Revert "Add cache timings info to tlparse (#133504 )" This reverts commit 7eb31e5023fa16c51a984257ee7ee4e17fb3c682. Reverted https://github.com/pytorch/pytorch/pull/133504 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))	2024-08-15 13:43:16 +00:00
Thanh Ha	bbddde311a	Migrate inductor jobs to runner determinator (#133457 ) Updates inductor jobs to use the runner determinator script. Depends-On: pytorch/pytorch#133352 Closes: pytorch/ci-infra#257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133457 Approved by: https://github.com/ZainRizvi	2024-08-15 12:16:39 +00:00
Alnis Murtovi	9876aa39c0	AutoHeuristic: pad_mm documentation (#133411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411 Approved by: https://github.com/Chillee ghstack dependencies: #133409, #133410	2024-08-15 10:49:56 +00:00
Alnis Murtovi	f32a9e953f	AutoHeuristic: mixed_mm documentation (#133410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410 Approved by: https://github.com/Chillee ghstack dependencies: #133409	2024-08-15 10:49:56 +00:00
Alnis Murtovi	142353eca3	AutoHeuristic: util scripts (#133409 ) This PR introduces scripts that make it easier to use autoheuristic: - `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU. - `merge_data.py`: This script can be used to merge multiple training data files into a single file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409 Approved by: https://github.com/Chillee	2024-08-15 10:49:56 +00:00
Lev A. Melnikovsky	b0fc6aa412	fix a typo in the householder_product docs (#124279 ) The function argument is A, not V. Remaining inconsistency is the matrix $A$ with columns $v_i$. It seems, a better solution would be to rename the argument $A \rightarrow V$, but this might lead to backward compatibility issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124279 Approved by: https://github.com/lezcano	2024-08-15 09:34:17 +00:00
y-sq	b6335cfeab	Add an option to use do_bench_using_profiling in TORCHINDUCTOR_PROFILE (#133523 ) When I did profiling using the "TORCHINDUCTOR_PROFILE" option, some kernel shows less bandwidth than expected. So, added the option to exclude the CPU overheads from the profiling time: ``` # With the option: (pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_WITH_DO_BENCH_USING_PROFILING=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py 0.038ms 0.067 GB 1777.11GB/s triton_poi_fused__to_copy_clamp_clone_mul_0 SUMMARY (/tmp/torchinductor_shuqiyang/tmp03wdg8e4/m6/cm6vdqp62ofwsone3u3fmb42vs3fti5omseo3qn4ddh2bhalsvbn.py) 0.04ms 0.07 GB 1777.11GB/s # Without the option: (pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py 0.040ms 0.067 GB 1663.09GB/s triton_poi_fused__to_copy_clamp_clone_mul_0 SUMMARY (/tmp/torchinductor_shuqiyang/tmpwr6rraao/s4/cs4npkh77myatwpcmsizyduyfm6ne6o4pg4n3eodejdvvg2j3xzd.py) 0.04ms 0.07 GB 1663.09GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133523 Approved by: https://github.com/nmacchioni	2024-08-15 09:27:11 +00:00
wz337	cf1fc07bd4	[DTensor][Easy] Minor fix to Partial Placement Docstring (#133149 ) Minor doc fix: The reduce op string for product should be "product" instead of "prod". https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L1045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133149 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2024-08-15 08:09:30 +00:00
David Berard	e6272acaec	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Differential Revision: [D61284135](https://our.internmc.facebook.com/intern/diff/D61284135) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-15 07:32:51 +00:00
Aaron Orenstein	c88174df95	typing for remote_cache (#133446 ) Summary: typing annotations for remote_cache Redo of #133299 with fixed annotations. Test Plan: unit tests Differential Revision: D61271883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133446 Approved by: https://github.com/oulgen	2024-08-15 06:36:13 +00:00
Oguz Ulgen	7eb31e5023	Add cache timings info to tlparse (#133504 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504 Approved by: https://github.com/jamesjwu ghstack dependencies: #133362, #133363, #133374	2024-08-15 05:53:00 +00:00
Alnis Murtovi	448d54ee92	AutoHeuristic: instructions (#132894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894 Approved by: https://github.com/Chillee	2024-08-15 04:54:54 +00:00
leslie-fang-intel	8624a571b4	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-08-15 02:06:30 +00:00
PyTorch MergeBot	1120b5ab55	Revert "[CI] Change inductor-perf-test-nightly naming (#131476 )" This reverts commit 86cb24e6ebf1b85840568fbc62d22629abaf5739. Reverted https://github.com/pytorch/pytorch/pull/131476 on behalf of https://github.com/desertfire due to manually trigged dashboard run failed ([comment](https://github.com/pytorch/pytorch/pull/131476#issuecomment-2290224084))	2024-08-15 01:18:06 +00:00
chilli	c2b2969b5d	made some args optional in create_mask (#133413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133413 Approved by: https://github.com/yanboliang, https://github.com/drisspg	2024-08-15 00:34:55 +00:00
Denis Vieriu	8676401707	[MPS] Enable MPS mm from macOS >= 14.4 (#133494 ) Summary of changes: - [MPS] Enable MPS `mm` op from macOS >= 14.4. Previously this was disabled in https://github.com/pytorch/pytorch/pull/117549 as it was causing crashes with large matrices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133494 Approved by: https://github.com/malfet	2024-08-15 00:25:22 +00:00
Oguz Ulgen	dcdb25453e	Make FX Graph Cache work with distributed training (#133374 ) During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374 Approved by: https://github.com/ezyang ghstack dependencies: #133362, #133363	2024-08-14 22:58:48 +00:00
Justin Chu	6d4287419c	[ONNX] Disable op_level_debug tests (#133485 ) op_level_debug is being deprecated. So we disable the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133485 Approved by: https://github.com/titaiwangms	2024-08-14 22:02:12 +00:00
Aart Bik	7a74294786	[sparse] enable meta tests (#133379 ) The skip for dynamo is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133379 Approved by: https://github.com/ezyang	2024-08-14 21:58:23 +00:00
Rachel Guo	3965f11837	Minor type annotation updates following up D60954888 (#133382 ) Summary: As title. Test Plan: CI Ran lintrunner locally but might have to continue to keep an eye on more oss linting issue if comes up. Differential Revision: D61240900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133382 Approved by: https://github.com/ColinPeppler	2024-08-14 21:36:42 +00:00
Zain Rizvi	d8c494910b	[EZ] Enable explicitly opting into the old Linux Amazon 2 ami - Pt 1 (#133469 ) For the next phase of the Amazon 2023 migration we'll be bulk migrating the remaining jobs over to the new AMI by changing the default AMI that we use. In preparation for that, we're adding the old Linux Amazon 2 ami as a fixed variant for runners, so that if any of the less frequently jobs breaks on Amazon 2023 AMI then they can shift to explicitly using the Amazon 2 AMI temporarily while the underlying problem is debugged and fixed. This PR is part 1, and there's a corresponding scale config PR in test-infra: https://github.com/pytorch/test-infra/pull/5551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133469 Approved by: https://github.com/clee2000	2024-08-14 21:33:02 +00:00
wz337	3fc9ee5a31	[DeviceMesh] Directly retrieve flattened mesh if already created (#133195 ) Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195 Approved by: https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #133193	2024-08-14 21:11:04 +00:00
wz337	44eaef46d0	[DCP] Fix meta tensor loading (#133256 ) We realized the fix for (https://github.com/pytorch/pytorch/pull/129683) loading the learning rate in place actually broke the meta tensor initialization. After the PR #129683, the learning rate is loading correctly, the param with meta tensors are still un-initialized. We cannot use `tree_map_only_` to iterate over state_dict for initialization in-place, as `empty_like` and `to("cuda")` are both not in-place option. More context in https://github.com/pytorch/pytorch/issues/130709 Therefore, with changes in (https://github.com/pytorch/pytorch/pull/129683), the tensor after loading are still meta tensors. We previously did not catch that since `self.assertEqual()` does not distinguish a DTensor with meta DTensor. In this PR, we added a _iterate_state_dict() function to implement in-place update for state_dict and updated the test to make sure that the params are no longer meta tensors after loading. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133256 Approved by: https://github.com/fegin	2024-08-14 21:07:11 +00:00
Ting Lu	c0be0105c7	[aarch64] Replace OpenBLAS with NVPL in cuda arm docker (#132811 ) Add NVPL to CUDA ARM docker build original https://github.com/pytorch/builder/pull/1928 moving to pytorch/pytorch repo now Need to go with builder repo change https://github.com/pytorch/builder/pull/1950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132811 Approved by: https://github.com/atalman	2024-08-14 21:01:50 +00:00
Sergii Dymchenko	2e8c1be947	Update date for 2.5 in RELEASE.md (#133503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133503 Approved by: https://github.com/atalman	2024-08-14 20:45:58 +00:00
Bin Bao	86cb24e6eb	[CI] Change inductor-perf-test-nightly naming (#131476 ) Summary: To make it consistent with inductor-perf-test-nightly-x86 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131476 Approved by: https://github.com/huydhn, https://github.com/zou3519	2024-08-14 20:42:15 +00:00
Bin Bao	bedf96d7ff	[AOTI] Switch fbcode HIP to C shim version v2 (#133105 ) Summary: Completely switch over the default value of c_shim_version to 2 Test Plan: CI Differential Revision: D60674018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133105 Approved by: https://github.com/ColinPeppler, https://github.com/zoranzhao	2024-08-14 19:39:10 +00:00
Xintong Hu	6980e9e569	[AOTI] Disable split_cat_aten passes (#133014 ) Summary: disable passes with negative performance impact Test Plan: run UT Differential Revision: D60970288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133014 Approved by: https://github.com/frank-wei	2024-08-14 19:36:17 +00:00
Oguz Ulgen	63e5b09218	Add unit test for asymmetric compilation (#133363 ) Unit test for asymmetric compilation Pull Request resolved: https://github.com/pytorch/pytorch/pull/133363 Approved by: https://github.com/jamesjwu ghstack dependencies: #133362	2024-08-14 19:32:18 +00:00
Oguz Ulgen	6f51782a59	Add comptime.sleep (#133362 ) Add comp time sleep for NCCL timeout testing. The unit test is not great.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133362 Approved by: https://github.com/jamesjwu	2024-08-14 19:32:18 +00:00
Nicolas Macchioni	cf81180007	allow `SubConfigProxy` of arbitrary depth (#133418 ) Before, having arbitrary depth nested configs like ``` class Foo: foo: List[int] = [1, 2, 3] class Bar: bar: str = "1" class Baz: baz: int = 1 ``` would cause problems beyond the first layer. For example, if we tried ``` from torch._inductor import config as inductor_config print(inductor_config.Foo) print(repr(inductor_config.Foo.foo)) print(inductor_config.Foo.Bar) print(repr(inductor_config.Foo.Bar.bar)) print(inductor_config.Foo.Bar.Baz) print(repr(inductor_config.Foo.Bar.Baz.baz)) ``` we would get some output like ``` <torch.utils._config_module.SubConfigProxy object at 0x7fac65de00a0> [1, 2, 3] ... AttributeError: torch._inductor.config.Foo.Bar does not exist ``` Obviously, this is not what we want. With these changes, we get the right values ``` <torch.utils._config_module.SubConfigProxy object at 0x7f840d05bf40> [1, 2, 3] <torch.utils._config_module.SubConfigProxy object at 0x7f840cedc940> '1' <torch.utils._config_module.SubConfigProxy object at 0x7f840cedc100> 1 ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133418 Approved by: https://github.com/oulgen	2024-08-14 18:43:00 +00:00
PyTorch MergeBot	d46e0761ca	Revert "[11/N] Fix clang-tidy warnings in aten/src/ATen (#133298 )" This reverts commit 35785984013a74469de8c1d29eaecb25aa0c141e. Reverted https://github.com/pytorch/pytorch/pull/133298 on behalf of https://github.com/izaitsevfb due to causes build time regression in aten/src/ATen/native/cpu/ReduceOpsKernel.cpp ([comment](https://github.com/pytorch/pytorch/pull/133298#issuecomment-2289453440))	2024-08-14 17:47:12 +00:00
Nikita Shulga	07c73a964b	[MPS][BE] Delete MacOS-12.3 specific checks (#133141 ) And make MPS device unavailable on Sonoma releases As lots of those checks 2 years old, are no longer validated in CI and probably much more such checks are missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/133141 Approved by: https://github.com/kulinseth, https://github.com/clee2000, https://github.com/atalman	2024-08-14 17:42:40 +00:00
Catherine Lee	7b269cc484	[TD] llm retrieval to not use bash -l {0} (#133464 ) https://github.com/pytorch/pytorch/pull/129720 swapped the action used to setup miniconda from [conda incubator](https://github.com/conda-incubator/setup-miniconda) to the [custom action](`2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)`) we have in test-infra that comes with caching. The original miniconda [relies on bash profiles](`e5293c8fd2/README.md (L746)`) to set the environment variables needed to run conda, but the test infra version relies on the user using the env vars that are set during the step. This PR changes the job to not use `bash -l {0}` to see if not activating bash profile has an effect on the run. Unfortunately this failure happens rarely on main so I'm not sure I will be able see if this has an effect. On the plus side, changing this doesn't seem to have a negative effect on the job, so it should be a noop at worst. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133464 Approved by: https://github.com/kit1980	2024-08-14 16:53:41 +00:00
eellison	4bb1650ca3	Bump maxinum num warps (#132458 ) Fix for https://github.com/pytorch/pytorch/issues/129104 Our heuristic for num_warps was giving the optimal number, but we were capping maximum num_warps at 8. Gives 1% speedup on HF and TIMM in inference, 2% speedup in TIMM training, neutral otherwise. ultimately, I think we want live var analysis for register usage.. still worth landing this now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132458 Approved by: https://github.com/Chillee, https://github.com/shunting314	2024-08-14 16:51:05 +00:00
Chien-Chin Huang	d114fd78bd	[FSDP2] Enable HSDP + TP (#133335 ) This PR enables HSDP + TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/133335 Approved by: https://github.com/awgu	2024-08-14 16:34:04 +00:00
Thanh Ha	7f40ac9be2	Migrate periodic jobs to use runner determinator (#133124 ) This updates the Linux & Windows jobs in periodic.yml to use the runner determinator script. Closes: pytorch/ci-infra#261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133124 Approved by: https://github.com/ZainRizvi	2024-08-14 16:04:15 +00:00
Zain Rizvi	118b2a4139	Convert inductor jobs to Linux Amazon 2023 (#133352 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133352 Approved by: https://github.com/zxiiro, https://github.com/seemethere	2024-08-14 15:59:33 +00:00
Daniil Kutz	62cd065de2	Validate that node TK_ASSIGN have field initialized (#127878 ) Fixes segmentation fault during model load via C++ API. An `Assign` statement (`TK_ASSIGN` type) have 3 fields: `lhs`, `rhs` and `type`. Field `type` is of type `Maybe`, which means it could be not presented. During model load in `import_source.cpp` field `type` is dereferenced without validation. It is similar error that have been fixed in #106041. Fixes #127877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127878 Approved by: https://github.com/malfet	2024-08-14 15:27:58 +00:00
Isuru Fernando	e554f71d7e	Implement filter in dynamo (#131674 ) Fixes https://github.com/pytorch/pytorch/issues/128944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131674 Approved by: https://github.com/amjames, https://github.com/jansel	2024-08-14 14:54:13 +00:00
Nicolas Macchioni	854a5ba958	[lint] fix lint broken by #131912 (#133428 ) lint Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133428 Approved by: https://github.com/aaronenyeshi	2024-08-14 14:50:18 +00:00
Yuanhao Ji	378b12f3ad	Improve namespace for `c10::MemoryFormat::Contiguous` in `torchgen/api/cpp.py` (#131622 ) Top-level namespaces are more convenient for out-of-tree device extensions. For example, now we have a patch for it in `torch_npu`: `98c50ced16/codegen/gen_backend_stubs.py (L772-L778)` ```python JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622 Approved by: https://github.com/zou3519	2024-08-14 14:41:01 +00:00
Wu, Chunyuan	efc6e8457a	[inductor] [cpp] fix the reindexer from template_buffer to Y (#133070 ) This PR fixes the accuracy of jx_nest_base and part of the accuracy issue of convnext_base of the max-autotune path. Another fix (https://github.com/pytorch/pytorch/pull/133073 in this ghstack) is needed to make convnext_base fully pass the accuracy check. The index calculated via the reindexer was wrong before this PR. Both the shape of the reshape reindexer and the stride order of the stride reindexer needs to be fixed. Index calculated before this PR: ``` # in_ptr4 points to arg4_1: size = (1, 32, 18, 18), stride = (10368, 1, 576, 32)) auto tmp7 = in_ptr4[static_cast<long>((32L(static_cast<long>((n_start + x1 + (32Lm_start) + (32Lx0))) % static_cast<long>(18L))) + (576L(static_cast<long>(c10::div_floor_integer((n_start + x1 + (32Lm_start) + (32Lx0)), 324L)) % static_cast<long>(32L))))]; ``` The correct one after the fix is: ``` auto tmp7 = in_ptr4[static_cast<long>(n_start + x1 + (32L*(static_cast<long>((m_start + x0)) % static_cast<long>(324L))))]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133070 Approved by: https://github.com/jgong5	2024-08-14 11:42:03 +00:00
Yanbo Liang	52741043e7	[Inductor][FlexAttention] Support non-divisible sequence lengths (#133019 ) Perf benchmark script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc * Update ```Q_LEN``` and ```KV_LEN``` to ```8192-9``` for testing non divisible cases. Run ```python perf_bench.py --partial-mask```. * Before this PR \| Seqence length \| Forward \| Backward \| \|---------------------\|-----------------\|------------------\| \| Divisible(8192) \| 0.87 \| 0.85 \| \| Non-divisible(8192-9) \| N/A \| N/A \| * After this PR \| Seqence length \| Forward \| Backward \| \|---------------------\|-----------------\|------------------\| \| Divisible(8192) \| 0.87 \| 0.85 \| \| Non-divisible(8192-9) \| 0.83 \| 0.78 \| Memory out of bounds check passed: * ```PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck python perf_bench.py --partial-mask``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133019 Approved by: https://github.com/Chillee	2024-08-14 10:27:39 +00:00
Edward Z. Yang	b5711297a0	Add support for SetVariable.discard (#133317 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133317 Approved by: https://github.com/Skylion007	2024-08-14 09:10:36 +00:00
wz337	ef580a0e5c	[DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193 ) This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed. ``` mesh_3d = init_device_mesh( self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"), ) # valid 2d slicing mesh_2d = mesh_3d["dp", "cp"] mesh_2d = mesh_3d["dp", "tp"] mesh_2d = mesh_3d["cp", "tp"] # invalid 2d slicing mesh_2d = mesh_3d["cp", "dp"] mesh_2d = mesh_3d["tp", "cp"] mesh_2d = mesh_3d["tp", "dp"] # valid way to create dp_cp flatten slice dp_cp_mesh = mesh_3d["dp", "cp"]._flatten() # invalid way to create dp_cp flatten slice dp_cp_mesh = mesh_3d["cp", "dp"]._flatten() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193 Approved by: https://github.com/fegin, https://github.com/wanchaol	2024-08-14 07:18:41 +00:00
wz337	d143f879e2	[DTensor] Add more aten._foreach ops to _pointwise_ops.py (#133271 ) Fixes #ISSUE_NUMBER Follow up for https://github.com/pytorch/pytorch/pull/132056. Added the missing foreach ops pointed out by @ad8e. ``` _foreach_sub.Scalar _foreach_exp _foreach_exp_ _foreach_cos_ _foreach_log_ ``` As @ad8e mentioned, since the list of _foreach ops at https://pytorch.org/cppdocs/api/library_root.html is long and overload-heavy, it could be annoying to manually keep this file updated. We might need to come up with a way to update the list and add associated tests systematically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133271 Approved by: https://github.com/awgu	2024-08-14 07:14:29 +00:00
Michael Lazos	a6413d2924	Regression test for S429861 (#133376 ) Adds repro test to verify that https://www.internalfb.com/sevmanager/view/429861 does not occur again. I haven't been able to reduce the size of the repro further, if I remove any buffers the error disappears! Pull Request resolved: https://github.com/pytorch/pytorch/pull/133376 Approved by: https://github.com/eellison	2024-08-14 06:55:05 +00:00
Avik Chaudhuri	a30504b2a2	fix silly error when printing diff (#133345 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/133336 When we fail to suggest fixes for a data dependent error because some symbols couldn't be mapped to sources, we print out those symbols but there was a silly bug in the printing code. New error: ``` ... raise self._make_data_dependent_error( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1))) (unhinted: Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1)))). (Size-like symbols: u0) Potential framework code culprit (scroll up for full backtrace): File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/torch/_refs/__init__.py", line 2972, in expand guard_size_oblivious(requested_length == x) For more information, run with TORCH_LOGS="dynamic" For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0" If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing For C++ stack trace, run with TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 The following call raised this error: File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/pyspeech/nn/utils.py", line 271, in lengths_to_padding_mask ).expand(batch_size, max_length) ``` Test Plan: Repro gets past reported error, hits new error Differential Revision: D61221994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133345 Approved by: https://github.com/ezyang	2024-08-14 06:52:55 +00:00
drisspg	4d11a9b783	[CI] Fix rowwise scaling tests on h100 (#133384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133384 Approved by: https://github.com/malfet, https://github.com/nWEIdia	2024-08-14 05:58:33 +00:00
IvanKobzarev	7aee3376e2	[aotd] HOP effect tokens wrapper above SubclassWrapper (#131672 ) Original issue: https://github.com/pytorch/pytorch/issues/129486 Before subclass_wrapper() got inputs containing additional effect tokens and failed as this did not match SubclassMeta indexes. This happened as functionalization was responsible to add / remove those tokens. Functionalization can not be run above Subclasses, as args/outs are duplicated in case of mutations. The main design thought is to keep logic of EffectTokens, Subclasses, Functionalization to know as less as possible about each others transformations. For that extracting EffectTokens manipulation to a separate wrapper, which will be processed above SubclassWrapper, while functionalization will happen below SubclassWrapper as before. In that case subclass wrap/unwrap works without information of additional arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131672 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2024-08-14 05:57:17 +00:00
Janet Yang	2a4304329b	[wip][lowering] Add max_acc_splits (#133041 ) Summary: Model owners can set the lower_settings with max_acc_splits=2, and lowering will fail during model iteration, to alert them of possible performance degradation from increased fragmentation. Test Plan: Added unit tests Differential Revision: D60133589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133041 Approved by: https://github.com/hl475	2024-08-14 03:50:31 +00:00
sanchitintel	f951fcd1d7	Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887 ) ## Summary As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond). WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations. The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue. Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel. While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. ### Performance #### AMX Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded. In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead. Benchmarked with unit-tests. Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442 The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel. #### AVX2/AVX512 micro-kernels Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437 ### Follow-up 1. int4 WoQ GEMM micro-kernel will also be added in a separate PR. 2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. E2E perf measurement should be done with #131310. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-14 03:14:45 +00:00
Wouter Devriendt	918367ebb0	Add new runner: G4DN Extra Large with T4 for windows binary builds (#133229 ) Prep for #103104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133229 Approved by: https://github.com/ZainRizvi	2024-08-14 03:08:49 +00:00
Will Feng	1206958d89	[Dynamo] add EventVariable reconstruct (#133236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133236 Approved by: https://github.com/yifuwang	2024-08-14 02:56:11 +00:00
Jithun Nair	d1d6b370ce	Upgrade nightly wheels to rocm6.2 - 1 of 2 (docker images) (#132875 ) Fixes https://github.com/pytorch/pytorch/issues/132570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132875 Approved by: https://github.com/atalman	2024-08-14 02:46:48 +00:00
Jane Xu	14750dd737	Correct return type of grouping helper function in Optimizer (#133360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133360 Approved by: https://github.com/albanD	2024-08-14 01:56:02 +00:00
Menglu Yu	5fff960004	[PT2][Optimus] Extend split_stack_to_cats when split and stack have different dims (#133060 ) Summary: We observed a special case in AI CMF where the split and stack nodes have different dims, thus we extend our current implementation to include the special case. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/6d0502bc-c840-425e-b686-b00b0b7da5f5 Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923577411786 Network: Up: 353KiB Down: 611KiB (reSessionID-1f80d74b-543f-4856-b3bf-181283c0f7e3) Jobs completed: 29. Time elapsed: 5:36.7s. Cache hits: 0%. Commands: 4 (cached: 0, remote: 1, local: 3) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n ``` Counter({'pattern_matcher_nodes': 2321, 'pattern_matcher_count': 1320, 'normalization_pass': 280, 'merge_splits_pass': 250, 'extern_calls': 95, 'normalization_aten_pass': 28, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'batch_aten_add': 3, 'batch_aten_mul': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'fxgraph_cache_miss': 2, 'batch_linear_post_grad': 1}) torch graph https://www.internalfb.com/intern/everpaste/?color=0&handle=GK5kwRZRtEMCZTAJAJlRpekhPhp0br0LAAAz # e2e Differential Revision: D60998945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133060 Approved by: https://github.com/jackiexu1992	2024-08-14 01:45:12 +00:00
soulitzer	4af4910b1a	Reland "Construct NJT without graph breaks" (#133196 ) This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea. and adds changes from https://github.com/pytorch/pytorch/pull/133061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196 Approved by: https://github.com/ezyang ghstack dependencies: #133145	2024-08-14 01:11:13 +00:00
Zhengxu Chen	f23dbefe52	[export] Support "custom" metadata field. (#131912 ) Summary: Add a special field in Graph and Node level metadata called "custom" which should be mapped to a json-serializable object, and we guarantee this field should be always preversed across the following transformations: 1. copy/deepcopy 2. run_decompositions() 3. serialization 4. re-exporting Test Plan: :test_export -- -r custom_tag Reviewed By: angelayi Differential Revision: D60291839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131912 Approved by: https://github.com/angelayi	2024-08-14 01:09:01 +00:00
cyy	c2eeda5da0	[structural binding][12/N] Replace std::tie with structural binding (#131031 ) Follows #130830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131031 Approved by: https://github.com/ezyang	2024-08-14 00:51:34 +00:00
Nikita Shulga	7666ef9d9b	[GHF] Fix co-authors attribution (#133372 ) Acording to https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors Co-authors must be mentioned at the very end of commit message and separated by 2 newlines Test plan: ```python from trymerge import GitHubPR pr = GitHubPR("pytorch", "pytorch", 133189) print(pr.gen_commit_message()) ``` Fixes https://github.com/pytorch/pytorch/issues/133310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133372 Approved by: https://github.com/kit1980	2024-08-14 00:48:24 +00:00
cyy	3578598401	[11/N] Fix clang-tidy warnings in aten/src/ATen (#133298 ) Follows #133155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133298 Approved by: https://github.com/ezyang	2024-08-14 00:29:38 +00:00
Jeff Daily	fbb0adbc32	[TunableOp] lazy init TuningContext singleton (#133347 ) Forward fix after #132464 because TuningContext had been created during static library init, which creates the TuningResultsValidator, which tries to query HIP device properties before the HIP runtime has initialized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133347 Approved by: https://github.com/zixi-qi	2024-08-14 00:20:11 +00:00
Guilherme Leobas	5947169c9d	Add missing endline in exception message (#133240 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133240 Approved by: https://github.com/Skylion007	2024-08-14 00:11:39 +00:00
Nikita Shulga	c91bc499f7	[CI] Do not emit color escape sequence during testing (#133350 ) By forcing term to vt100 Fixes problem reported in https://github.com/pytorch/pytorch/issues/133330 but more broadly it should be fixed on Nova/Infra side Pull Request resolved: https://github.com/pytorch/pytorch/pull/133350 Approved by: https://github.com/zou3519	2024-08-13 23:39:16 +00:00
drisspg	caba37e99b	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser, https://github.com/Chillee	2024-08-13 23:37:50 +00:00
Yanbo Liang	9de023d44d	[Dynamo] Make torch.Size can be reconstructed by LOAD_CONST (#133342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133342 Approved by: https://github.com/mlazos, https://github.com/jansel	2024-08-13 23:18:38 +00:00
Rachel Guo	c17d26c3c1	[AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016 ) Summary: Follow up small diff to fix a couple issues: - add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen - other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option Test Plan: ``` AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D60954888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016 Approved by: https://github.com/ColinPeppler	2024-08-13 23:00:29 +00:00
David Berard	41da528378	[BE] Skip inductor+profiler test for templates if we didn't run select_autotune (#133344 ) Sometimes we don't have enough SMs to do autotuning and then we fall back to aten, in which case we won't run the template kernel and it won't show up in the profile trace. Differential Revision: [D61222101](https://our.internmc.facebook.com/intern/diff/D61222101/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133344 Approved by: https://github.com/masnesral	2024-08-13 22:58:24 +00:00
Prachi Gupta	8e074c4583	[ROCm] skip SymmetricMemory related UTs for ROCm (#133241 ) This features is not yet supported on ROCm. Skipping: distributed/test_symmetric_memory.py::SymmetricMemoryTest::test_low_contention_all_gather_symm_mem_input_False With the errors: RuntimeError: CUDASymmetricMemory requires PYTORCH_C10_DRIVER_API_SUPPORTED Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133241 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-08-13 22:33:51 +00:00
Thanh Ha	5a1d4f7ddc	Migrate lint.yml to runner determinator (#133320 ) Update the jobs in lint.yml to use the runner determinator. Closes: pytorch/ci-infra#258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133320 Approved by: https://github.com/Skylion007	2024-08-13 22:16:32 +00:00
Aart Bik	a9d34138df	[traced-graph][sparse] add to_dense() operation to sparse export test (#133175 ) This works for sparse COO but surprisingly still fails for the other compressed sparse cases. I filed the following bug report: https://github.com/pytorch/pytorch/issues/133174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133175 Approved by: https://github.com/ezyang	2024-08-13 20:36:40 +00:00
PyTorch MergeBot	69de9e78e9	Revert "typing for remote_cache (#133299 )" This reverts commit 2fde1934f9efc418cc5a398bd0b09b29551cc091. Reverted https://github.com/pytorch/pytorch/pull/133299 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133299#issuecomment-2287067434))	2024-08-13 20:26:24 +00:00
Nicolas Macchioni	fa7ae6cdbc	can't infer device on benchmarked function with no args or kwargs (#133290 ) when we call benchmarker.benchmark(fn, (), {}) it attempts to infer the device from the args and kwargs, which are both empty. in this case the default behavior is to assume CPU, since `is_cpu_device` is implemented as `all([x.device == "cpu" for x in ... if x is Tensor])`, and `all([]) == True`. I've added a PR that makes this raise an error, but we should just fix this one callsite first Pull Request resolved: https://github.com/pytorch/pytorch/pull/133290 Approved by: https://github.com/eellison	2024-08-13 20:13:44 +00:00
YangQun1	dfc7c860e4	Allow SymInt input for torch.fx reinplace pass (#133178 ) Fixes #133176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133178 Approved by: https://github.com/ezyang	2024-08-13 20:07:17 +00:00
briancoutinho	61625a18ef	[profiler] Only parse kineto requests and build tree when required (#132713 ) To avoid high overheads of constructing datastructure in python when the user is simply saving trace to a file, we only process things lazily. ## Details 1. Delay function event parsing, add a flag to denote when needed. 1. Make profiler.function_events a computed property so code using `prof.function_events` does not need to change. 1. Fix coverage for `str(prof)` in profiler tests. ## Test run Test program ``` import torch from torch.profiler import profile, record_function, ProfilerActivity def payload(use_cuda=False): x = torch.randn(10, 10) if use_cuda: x = x.cuda() y = torch.randn(10, 10) if use_cuda: y = y.cuda() z = torch.mm(x, y) z = z + y if use_cuda: z = z.cpu() with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof: with record_function("model_inference"): payload() prof.export_chrome_trace("/tmp/test_trace.json") #print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) ``` The print "this is computing events" will happen lazily. ``` >]$ python3 profiler_test.py Brian: this is computing function events ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 6.77% 441.628us 100.00% 6.523ms 6.523ms 1 aten::randn 1.86% 121.108us 46.93% 3.061ms 1.530ms 2 aten::mm 45.36% 2.959ms 45.44% 2.964ms 2.964ms 1 aten::normal_ 44.72% 2.917ms 44.72% 2.917ms 1.458ms 2 aten::add 0.87% 56.646us 0.87% 56.646us 56.646us 1 aten::empty 0.35% 22.808us 0.35% 22.808us 11.404us 2 aten::resolve_conj 0.08% 5.173us 0.08% 5.173us 1.724us 3 ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 6.523ms $> python3 profiler_test.py (pytorch) [bcoutinho@devgpu038.ftw6 /data/users/bcoutinho/pytorch (profiler_optimize_parsing)]$ $>ls -a profiler_test.py $> ls -l /tmp/test_trace.json -rw-r--r-- 1 bcoutinho users 16471 Aug 5 16:10 /tmp/test_trace.json ``` ## Unit test Updates some tests and they all pass now. `pytest test/profiler/test_profiler.py` Also `python test/test_autograd.py TestAutogradWithCompiledAutograd.test_record_function` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132713 Approved by: https://github.com/sraikund16	2024-08-13 18:58:20 +00:00
Sam Larsen	657d58bbd8	[inductor] Fix test_codecache::test_inductor_counters (#133244 ) Summary: This test is flakey internally, but it's not a great test in the first place since it's relying on the max-autotune step to bump a related counter. Instead of doing that, directly install a mock that bumps a counter specifically for this test. Additionally, test that the caching logic correctly accommodates an arbitrary counter delta (previously the relevant counter is only bumped by +1). Differential Revision: [D61141164](https://our.internmc.facebook.com/intern/diff/D61141164) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133244 Approved by: https://github.com/eellison	2024-08-13 18:52:27 +00:00
Aaron Orenstein	2fde1934f9	typing for remote_cache (#133299 ) Summary: typing annotations for remote_cache Test Plan: unit tests Reviewed By: oulgen Differential Revision: D60937968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133299 Approved by: https://github.com/Skylion007	2024-08-13 18:28:41 +00:00
Xavier Dupré	a1ca4dfe0b	[ONNX] Fix onnx conversion scaled_dot_product_attention (#133314 ) Fixes error message raised by the torch>=2.5: A mismatch between the number of arguments (8) and their descriptors (7) was found at symbolic function 'scaled_dot_product_attention' by adding the newly introduced use_gqa parameter. From https://github.com/pytorch/pytorch/pull/132689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133314 Approved by: https://github.com/Skylion007, https://github.com/justinchuby	2024-08-13 18:22:24 +00:00
Feng Shi	19416bf38b	Reland "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 )" (#133291 ) Reland by reverting commit 844103197d3e8cf6b4b59176e473365113f4f962. #131675 failed a few internal tests because it imported a diff version which wasn't rebased on the proper dependent diffs. Reland from OSS only to avoid the out-of-sync issue. Original description from #131675 Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py This is part 2 pull request which deals with the 2nd case above: The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/133291 Approved by: https://github.com/wdvr	2024-08-13 18:18:12 +00:00
Aaron Enye Shi	dadb20a9d6	[Memory Snapshot][Viz] Add Allocator Settings Tab (#132518 ) Summary: Since we are storing the allocator settings in the snapshot files for awhile now (since https://github.com/pytorch/pytorch/pull/119404), we can expose this to users with a new tab in the visualizer. Test Plan: Ran it locally: ![image](https://github.com/user-attachments/assets/5f79ccd0-fe1c-4e42-bb58-106d8f3cccd6) Differential Revision: D60673548 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132518 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-13 17:35:12 +00:00
Aaron Enye Shi	7172c732d9	[Memory Snapshot] Skip C++ warmup unwind() call if context is not set (#133038 ) Summary: Should skip C++ warmup `unwind::unwind();` if there is no context set. This call is sometimes causing hanging issues since C++ stack collection is not robust. Test Plan: CI Differential Revision: D60965985 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/133038 Approved by: https://github.com/eqy	2024-08-13 17:25:24 +00:00
Henry Tsang	be400ee2b4	[inductor][test] Fix test_vertical_pointwise_reduction_fusion (#133276 ) Summary: Fix after https://github.com/pytorch/pytorch/pull/131649 changes behavior for fusion. Test Plan: ci Differential Revision: D61165949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133276 Approved by: https://github.com/ColinPeppler	2024-08-13 17:18:43 +00:00
Xu Han	89795da5e3	[inductor] process compile_only case in all build options class. (#129975 ) Optimize `compile_only` logical. Origin code only apply for `CppTorchCudaOptions`, this PR make it apply for all build option classes. Changes: 1. Remove `libraries_dirs` and `libraries` settings, when `compile_only`. 2. Remove compile_only from CppTorchCudaOptions. 3. Make the `compile_only` apply for all classes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129975 Approved by: https://github.com/henrylhtsang	2024-08-13 16:45:27 +00:00
Sahdev Zala	19270cff61	Add a reference for the LRScheduler class (#133243 ) The `LRScheduler` class provides methods to adjusts the learning rate during optimization (as updated in this PR). Also, as a note, all the classes of lr_scheduluer are already provided in the `How to adjust learning rate` section. Fixes #127884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133243 Approved by: https://github.com/janeyx99	2024-08-13 16:20:22 +00:00
drisspg	aa4fbba42d	Make q info optional in prep for inference (#133261 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133261 Approved by: https://github.com/Chillee ghstack dependencies: #132969	2024-08-13 16:09:39 +00:00
Zain Rizvi	660436d843	Convert Periodic to use Amazon2023 runners (#133036 ) Fixes #ISSUE_NUMBER Co-authored-by: clee2000 <44682903+clee2000@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133036 Approved by: https://github.com/clee2000, https://github.com/zxiiro	2024-08-13 15:59:50 +00:00
cyy	2f30473fba	[19/N] Fix clang-tidy warnings in jit (#133067 ) Follows #132963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133067 Approved by: https://github.com/Skylion007	2024-08-13 15:59:43 +00:00
Thanh Ha	2e7d67e6af	Migrate slow.yml jobs to use runner determinator (#133232 ) Update the jobs in slow.yml to use the runner determinator script. Closes: pytorch/ci-infra#259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133232 Approved by: https://github.com/ZainRizvi	2024-08-13 15:44:55 +00:00
Guilherme Leobas	c518b50c4c	Remove functorch dispatch keys in `legacyExtractDispatchKey` (#133018 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133018 Approved by: https://github.com/zou3519	2024-08-13 15:32:01 +00:00
James Wu	cd565bc455	Refactor process_inputs outside of create_aot_dispatcher_function (#130962 ) This PR refactors process_inputs so that it occurs earlier outside of create_aot_dispatcher_function for the purpose of calculating a cache key with the inputs after they have been processed. This way, if tensors have symint sizes/strides, we successfully factor that into the cache key instead of specializing on every possible size and stride. Test that utilizes this incoming. # Guard behavior Note that it's technically possible for tensors with symint arguments to introduce guards in aot_dispatch, if they trace through decompositions that branch on tensor size/stride. This can result in multiple graph modules with differing guards having the same key in the cache. FXGraphCache has this same issue, and the remote FXGraphCache intentionally does not handle this: instead it only saves the first result in the cache, and cache misses if guards miss. The local FXGraphCache does handle this by storing multiple files and iterating through them, but we opt not to introduce that complexity just yet for AOTAutogradCache until we deem it necessary (i.e., models appear where saving multiple cache results with different guards but the same cache key becomes important). Instead, AOTAutogradCache will save a single entry per result, overriding it if it cache misses due to guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130962 Approved by: https://github.com/bdhirsh	2024-08-13 14:56:00 +00:00
PyTorch MergeBot	4cca18d5b6	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit e61def65d5c6268e79f52776f75277ee60f01462. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/albanD due to Broke forward AD tests in main ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2286432628))	2024-08-13 14:44:08 +00:00
chuanqiw	095c5ccf9f	[CD] Change XPU nightly build back to ABI=0 (#132854 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132854 Approved by: https://github.com/atalman	2024-08-13 13:46:29 +00:00
cyy	e0a5536cc9	[2/N] Fix clang-tidy warnings in torch/csrc/autograd (#133295 ) Follows #133180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133295 Approved by: https://github.com/Skylion007	2024-08-13 13:23:46 +00:00
Aruna K	7756175273	Add Sleef Implementation for maximum Kernel for ARM (#131642 ) The NEON Vectorized<float> implementation does not use SLEEF functions for maximum Implementation. So updated maximum function with sleef calls for better performance on graviton3.It showed good performance improvement in LLM models. The results are taken in graviton3 machine as follows: <img width="268" alt="perf_result" src="https://github.com/user-attachments/assets/8c575873-b985-44e1-ba8e-880fe6494c5f"> This maximum kernel is used in softmax. The performance timing of softmax with default and sleef change is as below:(graviton3 machine) <img width="265" alt="softmax" src="https://github.com/user-attachments/assets/3be22c0e-7c99-407e-a8d1-891cb1e035ad"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131642 Approved by: https://github.com/snadampal, https://github.com/jgong5	2024-08-13 11:08:14 +00:00
Pian Pawakapan	40061bd61e	[export] overwrite placeholder names when deepcopying (#133269 ) In joint-graph export we have a `copy.deepcopy(ep.graph_module)` call. This turns out to be an imperfect deepcopy, because deepcopy allows objects to overwrite their `__deepcopy__` methods. For fx.Graph, this ends up deferring to `Graph.create_node()`, which checks the graph namespace, and can avoiding copying the exact name in niche examples, like where the name is a Python keyword (e.g. `input` gets renamed to `input_1`). Names like `input` happen because export's placeholder naming pass overwrites what the namespace creates, based on the model's `forward()` signature. So we can either 1) avoid overwriting such cases, which requires rewriting the naming pass logic, or 2) force another overwrite after deepcopying. This goes with 2). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133269 Approved by: https://github.com/zhxchen17, https://github.com/dvorjackz, https://github.com/ydwu4	2024-08-13 10:20:43 +00:00
PyTorch UpdateBot	947a446be4	[executorch hash update] update the pinned executorch hash (#131420 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131420 Approved by: https://github.com/pytorchbot	2024-08-13 08:30:51 +00:00
Wanchao Liang	9f17037e8b	[dtensor] move tensor constructors to the api module (#133129 ) This is to ensure __init__.py only contain public APIs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129 Approved by: https://github.com/awgu, https://github.com/tianyu-l	2024-08-13 06:09:56 +00:00
cyy	50e837d9c2	[10/N] Fix clang-tidy warnings in aten/src/ATen (#133155 ) Follows #132842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133155 Approved by: https://github.com/janeyx99, https://github.com/ezyang	2024-08-13 03:48:58 +00:00
cyy	af7830e353	[1/N] Fix clang-tidy warnings in torch/csrc/autograd (#133180 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133180 Approved by: https://github.com/albanD	2024-08-13 03:36:10 +00:00
Pian Pawakapan	4671e98656	[export] fix node.users when inlining HOOs (#133144 ) The process of inlining HOO subgraphs (e.g. set_grad_enabled) seems to break node.users when a node is present in multiple subgraphs, for example: ``` class SetGradCase(torch.nn.Module): def forward(self, x): _x = x.shape[0] + 2 _xx = _x + 2 with torch.no_grad(): y = _x * 4 return _xx, y ``` The `_x` node contains 2 users (_xx and y) after being inlined, but with inspection it seems to only contain y as a user. Previously we were completely clearing node.users for output nodes in HOO subgraphs before inlining them - we should just be deleting the subgraph output nodes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133144 Approved by: https://github.com/larryliu0820, https://github.com/ydwu4	2024-08-13 03:21:42 +00:00
Oguz Ulgen	fa36eba77d	Turn off remote caching in unit tests unless explicitly on (#133258 ) Summary: This PR turns off remote caching in unit tests unless the unit test explicitly turns it on. Test Plan: existing tests Differential Revision: D61152154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133258 Approved by: https://github.com/masnesral	2024-08-13 02:49:43 +00:00
Luciano Bello	1e9bedf688	Add `_codecs.encode` and `builtins.bytearray` to `_get_allowed_globals` to support bytes and bytearray serialization (#133189 ) Fixes #133163 Debugged in collaboration with @hariveliki The `byte` type is demanding the global `_codecs.encode`. That means, the following currently works: ```python import torch torch.save(b'hello', '/tmp/dummy.pth') torch.serialization.add_safe_globals([_codecs.encode]) torch.load('/tmp/dummy.pth', weights_only=True) ``` Similarly, `bytearray` needs `builtins.bytearray`. Following the `torch.loads` docs promise, both types should be supported without `add_safe_globals` as they are both primitive types: > weights_only: Indicates whether unpickler should be restricted to > loading only tensors, primitive types, dictionaries > and any types added via :func:`torch.serialization.add_safe_globals`. This PR adds both `_codecs.encode` and `builtins.bytearray` to `_get_allowed_globals` and test for saving and loading of both types with and without `weights_only`. Co-authored-by: hariveliki <98284163+hariveliki@users.noreply.github.com> Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133189 Approved by: https://github.com/mikaylagawarecki	2024-08-13 02:20:28 +00:00
Alnis Murtovi	f1c439cbed	AutoHeuristic: refactoring (#133170 ) This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170 Approved by: https://github.com/Chillee	2024-08-13 01:46:53 +00:00
cyy	e76f0e0646	Remove QNNPACK reference from setup.py (#133177 ) QNNPACK has been removed from third party Pull Request resolved: https://github.com/pytorch/pytorch/pull/133177 Approved by: https://github.com/albanD	2024-08-13 01:19:12 +00:00
Sun, Jiayi	7be77658e9	[Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155 ) This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance. BTW, I fixed the UT of `byte` by setting the range of the sample inputs to [0, 255] since the range of `torch.uint8` is [0, 255]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130724	2024-08-13 01:12:05 +00:00
Sun, Jiayi	370b072d8d	[Inductor] support masked vectorization for the tail_loop of the 2d tiles kernel (#130724 ) This PR supports masked vectorization for the tail_loop of the 2d tiles kernel to improve the performance. Example: ``` import torch def fn(a): return torch.permute(a, (2, 0, 1)).contiguous() input = torch.randn(2, 20, 40) compiled_fn = torch.compile(fn) with torch.no_grad(): for _ in range(3): compiled_fn(input) ``` Generated code: - Before: ``` cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L)) { float tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 16); for (long x0_inner = 0; x0_inner < 16; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Lx0_inner), 16); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))); } } #pragma GCC ivdep for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0 + (40Lx1)), 16); [&] { __at_align__ std::array<float, 16> tmpbuf; tmp0.store(tmpbuf.data(), 16); #pragma GCC unroll 16 for (long x0_inner = 0; x0_inner < 16; x0_inner++) { out_ptr0[static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))] = tmpbuf[x0_inner]; } } () ; } } #pragma GCC ivdep for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0 + (40Lx1))]; out_ptr0[static_cast<long>(x1 + (40Lx0))] = tmp0; } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1)) buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32) cpp_fused_clone_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` - After: ``` cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L)) { float tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 16); for (long x0_inner = 0; x0_inner < 16; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Lx0_inner), 16); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))); } } #pragma GCC ivdep for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L)) { float tmp0[168] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,8,16>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 8); for (long x0_inner = 0; x0_inner < 16; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8Lx0_inner), 8); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner)), 8); } } } #pragma GCC ivdep for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(8L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L)) { float tmp0[816] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,8>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 16); for (long x0_inner = 0; x0_inner < 8; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Lx0_inner), 16); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))); } } #pragma GCC ivdep for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L)) { float tmp0[88] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,8,8>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 8); for (long x0_inner = 0; x0_inner < 8; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8Lx0_inner), 8); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner)), 8); } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1)) buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32) cpp_fused_clone_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` Co-authored-by: CaoE <e.cao@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130724 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-13 01:02:24 +00:00
drisspg	e61def65d5	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser	2024-08-13 00:51:55 +00:00
PyTorch MergeBot	00aa086298	Revert "[dtensor] move tensor constructors to a separate module (#133129 )" This reverts commit e890d888d916b4f38b383a59e0e9445513c67313. Reverted https://github.com/pytorch/pytorch/pull/133129 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/133129#issuecomment-2285090400))	2024-08-12 23:55:08 +00:00
PyTorch MergeBot	89670d5bdd	Revert "Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887 )" This reverts commit 8fbd7d92a81b61d41363edb1b3902ba7701d5a27. Reverted https://github.com/pytorch/pytorch/pull/131887 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131887#issuecomment-2285082401))	2024-08-12 23:45:46 +00:00
PyTorch MergeBot	844103197d	Revert "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 )" This reverts commit bb6eef8ed1de0eb48bde10a07da57b6acc82fb05. Reverted https://github.com/pytorch/pytorch/pull/131675 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/131675#issuecomment-2285069508))	2024-08-12 23:31:16 +00:00
PyTorch MergeBot	656465fc77	Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749 )" This reverts commit ed97fb77f9a9d9d815f4975caccbc961ebbcb714. Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to fails internal jobs, see [S440348](https://www.internalfb.com/sevmanager/view/440348) ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2285051164))	2024-08-12 23:14:19 +00:00
drisspg	d4b31f7bcf	Refactor BlockMask constructorr and add Factory func (#132969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132969 Approved by: https://github.com/Chillee	2024-08-12 22:38:42 +00:00
Zain Rizvi	e553ef69d0	[BE] Fix typo (#133247 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133247 Approved by: https://github.com/c-p-i-o, https://github.com/zxiiro	2024-08-12 21:58:55 +00:00
rzou	8585dee85d	[inductor] Add some more reinplacing tests (#132839 ) Also add some tests around the counters we added in a previous PR. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132839 Approved by: https://github.com/eellison	2024-08-12 21:34:45 +00:00
Thanh Ha	592682fe22	Migrate nightly.yml to use runner determinator (#133225 ) Updates the nightly.yml jobs to use the runner determinator script. Closes: pytorch/ci-infra#260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133225 Approved by: https://github.com/ZainRizvi	2024-08-12 21:25:55 +00:00
Edward Z. Yang	80ed3e9ccd	s/dipatch/dispatch/g (#133192 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133192 Approved by: https://github.com/albanD	2024-08-12 20:26:58 +00:00
Catherine Lee	4f0d5f6551	Pin sympy to 1.13.1 (#133235 ) Sympy 1.13.2 release yesterday, and it results in test failures on windows and mac `454713fe9d/1` Hopefully these are the places it needs to get pinned Pull Request resolved: https://github.com/pytorch/pytorch/pull/133235 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-08-12 20:10:09 +00:00
Xu Han	36c4ed8e49	[inductor] add FreeLibrary to DLLWrapper for Windows. (#133184 ) For previous PR https://github.com/pytorch/pytorch/pull/132630 . We found `DLLWrapper` class doesn't have `_dlclose` implemention for Windows. I write a small test project to figure out how to make it works on Windows: https://github.com/xuhancn/ctypes_all_lifecycle/blob/main/pysrc/module_manage.py#L30-L61 Test result: https://github.com/xuhancn/ctypes_all_lifecycle/tree/main?tab=readme-ov-file#ctypes_cyclepy So, I have port the Windows FreeLibrary implemention to pytorch DLLWrapper in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133184 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-12 19:55:48 +00:00
drisspg	cdcc7dc891	update comit pin for xla (#133120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133120 Approved by: https://github.com/janeyx99	2024-08-12 19:38:37 +00:00
Li-Huai (Allan) Lin	cc1cc71c46	[MPS] Fix relu for 0-element input case (#133191 ) Fixes #133182 Should already be tested by `test/test_mps.py::MPSReluTest::testNumbersGPU`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133191 Approved by: https://github.com/albanD	2024-08-12 19:24:17 +00:00
Kiuk Chung	666362865c	[test/profiler] Make test_profiler_pattern_matcher_json_report write … (#133009 ) Makes it possible to run `test/profiler/test_profiler.py#test_profiler_pattern_matcher_json_report` on CI environments where the test runner doesn't have write permissions to the current-working-directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133009 Approved by: https://github.com/zou3519	2024-08-12 18:56:50 +00:00
PyTorch MergeBot	fa1d7b0262	Revert "Remove unused Caffe2 macros (#132979 )" This reverts commit da65cfbdea4f1f2176f6242004bda940a24f9ddb. Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332))	2024-08-12 18:34:56 +00:00
rzou	afb73d253c	[custom_ops] torch.library.{custom_op, register_kernel} disable Dynamo (#133125 ) We promise the user that these custom ops (and their kernels) are black boxes w.r.t. torch.compile. Unfortunately Dynamo can turn itself back on in the implementation of the custom operator, so we force it off by disabling Dynamo Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133125 Approved by: https://github.com/ezyang	2024-08-12 18:29:18 +00:00
Chien-Chin Huang	d53dfa4680	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu, https://github.com/wz337	2024-08-12 18:28:02 +00:00
Pierre Chapuis	0e4c0ef29f	fix type of `eta_min` parameter in CosineAnnealing (int -> float) (#132482 ) This fixes errors with type checkers such as `pyright`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132482 Approved by: https://github.com/janeyx99	2024-08-12 18:22:26 +00:00
Jane Xu	e7d8d73582	[minor] Correct in-code documentation for complex numbers and LBFGS (#133020 ) To @lezcano's credit, this should be associative, as floating point add is actually commutative per IEEE754. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133020 Approved by: https://github.com/soulitzer, https://github.com/lezcano	2024-08-12 18:04:11 +00:00
Jeff Daily	d51e5467fd	TunableOp unconditionally add all validators (#132464 ) For workloads that only exercised scaled_mm, the csv result file would not contain the same set of validators as a gemm workload. Trying to reuse the same csv file between workloads would discard the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132464 Approved by: https://github.com/zixi-qi	2024-08-12 17:35:00 +00:00
Riley Dulin	d61815cb7d	[torch][ao] Use returned model from Quantizer.transform_for_annotation in prepare_pt2e (#132893 ) Summary: The Quantizer subclass can return a new model from `transform_for_annotation`, and this is common if it uses any ExportPass subclass which does not mutate in-place. Use the returned model instead of assuming its the same. Differential Revision: D60869676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132893 Approved by: https://github.com/jerryzh168	2024-08-12 17:23:19 +00:00
Zain Rizvi	1371c420c3	Migrate binary builds to use Amazon2023 runners (#131826 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 Migrates all linux binary builds. The failures are windows jobs which aren't touched by this PR prev runs (for tracking): - https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=e1ee074b1e7b17008e3f3774e4842b5e1d4c1355 - https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=50a3488ae776f86bd6bead8b048b051c49a25ec7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131826 Approved by: https://github.com/malfet	2024-08-12 17:18:55 +00:00
Shangdi Yu	b06959e614	[export] change deepcopy to copy in _replace_with_hop passes (#133142 ) Summary: Add back the change in `19897a1647`. The change was lost in refactoring due to a bad rebase. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export ``` Differential Revision: D61052687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142 Approved by: https://github.com/ydwu4	2024-08-12 17:15:04 +00:00
Aaron Enye Shi	3128640c31	[Memory Snapshot][Viz] Show event timestamps if collected (#132523 ) Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events. Test Plan: Ran it locally. ![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064) Differential Revision: D60673800 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-12 16:12:04 +00:00
Catherine Lee	454713fe9d	Add inductor-cu124, inductor-rocm to upload test stats (#133143 ) Forgot to add them in https://github.com/pytorch/pytorch/issues/128250 and https://github.com/pytorch/pytorch/issues/131637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133143 Approved by: https://github.com/huydhn	2024-08-12 15:51:51 +00:00
PyTorch MergeBot	9641abe97a	Revert "[export] change deepcopy to copy in _replace_with_hop passes (#133142 )" This reverts commit 2d71f03db124bd1517627d34896dd2d9248227af. Reverted https://github.com/pytorch/pytorch/pull/133142 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133142#issuecomment-2284327241))	2024-08-12 15:48:11 +00:00
PyTorch MergeBot	e9eb8795bb	Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523 )" This reverts commit 27c44c884e28c9378677fb295a528c36c429c3f7. Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/clee2000 due to broke some tests on mac ex export/test_retraceability.py::RetraceExportTestExport::test_disable_forced_specializations_ok_retraceability [GH job link](https://github.com/pytorch/pytorch/actions/runs/10344621336/job/28630686528) [HUD commit link](`27c44c884e`) Possibly a landrace since I see that some of the failing tests ran on the PR ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2284312426))	2024-08-12 15:42:07 +00:00
Yuxin Wu	26b0a0c2f3	Fix fsdp_state_dict_type_without_warnings (#132621 ) Do actually ignore the warnings. Otherwise this is a no-op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132621 Approved by: https://github.com/fegin	2024-08-12 10:33:09 +00:00
laithsakka	f5e704a6f2	Add instruction count benchmark to run on pull requests (#131475 ) This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing. to access results goto test pr_time_benchmarks and inspect logs: you should see ``` + echo 'benchmark results on current PR: ' benchmark results on current PR: + cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt update_hint_regression,instruction_count,27971461254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475 Approved by: https://github.com/ezyang	2024-08-12 05:20:26 +00:00
Aaron Enye Shi	27c44c884e	[Memory Snapshot][Viz] Show event timestamps if collected (#132523 ) Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events. Test Plan: Ran it locally. ![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064) Differential Revision: D60673800 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-12 01:48:23 +00:00
PyTorch MergeBot	7f08b73980	Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523 )" This reverts commit 456909e5d350810e941290ee61c1dfc3315a9a69. Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2282925079))	2024-08-11 23:33:37 +00:00
Aaron Enye Shi	456909e5d3	[Memory Snapshot][Viz] Show event timestamps if collected (#132523 ) Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events. Test Plan: Ran it locally. ![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064) Differential Revision: D60673800 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-11 23:27:48 +00:00
Shangdi Yu	2d71f03db1	[export] change deepcopy to copy in _replace_with_hop passes (#133142 ) Summary: Add back the change in `19897a1647`. The change was lost in refactoring due to a bad rebase. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export ``` Differential Revision: D61052687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142 Approved by: https://github.com/ydwu4	2024-08-11 21:47:52 +00:00
Alnis Murtovi	e7b870c88b	mixed_mm: fix segfault when allow_tf32=True (#133173 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133173 Approved by: https://github.com/Chillee	2024-08-11 15:02:24 +00:00
chilli	04f37ed57d	Add support for returning LSE from FlexAttention (and also differentiating through it) (#133159 ) This PR changes the "contract" of `flex_attention_hop` to return LSE in base 2. However, we undo that and return LSE in base e from the `flex_attention` frontend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133159 Approved by: https://github.com/yanboliang	2024-08-11 10:29:16 +00:00
haozhe.zhu	78ccbad678	[inductor] remove dtype check/assert for reduction vec and support bool for min/max (#132473 ) This PR is to remove the dtype check/assert for vectorized reduction. And support bool for min/max reduction. After removing dtype check and assertion, failed on UT. ``` PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/inductor/test_torchinductor_opinfo.py -k TestInductorOpInfoCPU.test_comprehensive_max_reduction_no_dim_cpu_bool ``` Now it is supported, generated code: ``` cpp_fused_max_0 = async_compile.cpp_pybinding(['const bool', 'bool'], ''' #include "/tmp/torchinductor_root/xf/cxf75ftbahznonqovnsugw7v6sldrabizgtx3j4rhgdmu3r36wlu.h" extern "C" void kernel(const bool* in_ptr0, bool* out_ptr0) { { { bool tmp_acc0 = std::numeric_limits<bool>::min(); at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(std::numeric_limits<bool>::min()); for(long x0=static_cast<long>(0L); x0<static_cast<long>(112L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VecMask<float,1>::from(in_ptr0 + static_cast<long>(x0)); tmp_acc0_vec = tmp_acc0_vec \| tmp0; } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(112L); x0<static_cast<long>(125L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } tmp_acc0 = max_propagate_nan(tmp_acc0, tmp_acc0_vec.all_zero()); out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0); } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132473 Approved by: https://github.com/jgong5	2024-08-11 08:37:54 +00:00
Apurva Jain	79ca596dc6	Optimize test_transformers.py (#133049 ) - Reduced number of skipped test cases - Merged redundant test cases Benchmark: \| \| Original \| New \| \| ----- \| ----- \| ----- \| \| Run time \| 60 mins \| 35 mins \| \| Total tests \| 75k \| 18k \| \| Skipped tests \| 20k \| 4k \| _These are approximate numbers from running test_transformers.py on a single H100, and can change based on the device._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133049 Approved by: https://github.com/drisspg	2024-08-11 05:20:58 +00:00
Edward Z. Yang	a7912bf9dc	Make step != 0 test in slice irrefutable (#133091 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133091 Approved by: https://github.com/bdhirsh	2024-08-10 23:56:45 +00:00
cyy	5b7b3e4af0	Fix some issues detected by static analyzer (#132970 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132970 Approved by: https://github.com/ezyang	2024-08-10 16:02:46 +00:00
xinan.lin	92f650c5b3	[Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255 . (#126678 ) [Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126678 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison	2024-08-10 14:33:39 +00:00
Xu Han	4a3a30c36e	[inductor] remove deprecated cpp_builder implementation. (#133161 ) I have worked with @henrylhtsang to switch the cpp_builder to new one. We have removed the dependency to the old implementation. So, it is time to remove the old implementation now. This PR is done the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133161 Approved by: https://github.com/ezyang	2024-08-10 14:21:22 +00:00
cyy	32be3e942c	Remove -Wno-error=pedantic from CMake (#133074 ) The codebase is largely clean so that we can turn it on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133074 Approved by: https://github.com/ezyang	2024-08-10 13:11:21 +00:00
Simon Fan	b9922f7a5a	[compiled autograd][cpp node] No recaptures from saved float scalars (#133048 ) Partially addresses https://github.com/pytorch/pytorch/issues/130170 for float scalars saved from forward pass of a custom c++ autograd function. With this PR, compiled autograd no longer recaptures when the float value changes, but downstream support isn't there yet: `4bdb4bbd86/torch/_dynamo/config.py (L58-L61)` Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on float values, we lift the float. We also require user code to use IValue::toSymFloat instead of IValue::toDouble in order to swap the SymFloat to proxy during compiled autograd tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/133048 Approved by: https://github.com/jansel ghstack dependencies: #132771	2024-08-10 11:05:44 +00:00
Simon Fan	c860889a65	[compiled autograd][cpp node] No recompiles from saved int scalars (#132771 ) Addresses https://github.com/pytorch/pytorch/issues/130170 for int scalars saved from forward pass of a custom c++ autograd function Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on int values, we lift the ints. We also require user code to use IValue::toSymInt instead of IValue::toInt in order to swap the SymInt to proxy during compiled autograd tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/132771 Approved by: https://github.com/jansel	2024-08-10 11:05:44 +00:00
Xu Han	2ad011ca73	[inductor] remove debug code of AotCodeCompiler (#132823 ) Since we switch AotCodeCompiler to new cpp_builder: https://github.com/pytorch/pytorch/pull/132766 We can remove debug code of AotCodeCompiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132823 Approved by: https://github.com/henrylhtsang	2024-08-10 08:04:48 +00:00
Yuanhao Ji	343071cd96	Fix privateuse1 backend name case (#132980 ) ### Problem `get_privateuse1_backend(bool lower_case)` always returns a lower case name and `lower_case` is not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132980 Approved by: https://github.com/albanD	2024-08-10 07:39:54 +00:00
Avik Chaudhuri	c8275e25a7	fix requirement for error classification (#133122 ) Test Plan: none Differential Revision: D61039300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133122 Approved by: https://github.com/yushangdi	2024-08-10 04:59:09 +00:00
Xu Han	9f0d90655d	[inductor] cpp_builder add dynamo time trace for compile_file (#133103 ) trace `compile_file` time for cpp_builder. Ref: https://github.com/pytorch/pytorch/pull/132328/files#diff-c9b517f8db609ffa866804dfa2689188a4fee20abacaa0b0dca91625c1b5cb8dR2224 <img width="994" alt="image" src="https://github.com/user-attachments/assets/862c7943-79dc-4d06-b398-a09595ad1295"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133103 Approved by: https://github.com/ezyang	2024-08-10 04:55:02 +00:00
Chirag Pandya	cc5a57d185	Return from monitoring thread on TCPStore failure (#133150 ) Summary: Pessimisticly assume that things are being torn down if TCPStore is not available and do not attempt to dump stack traces. Test Plan: Seeing crashes in production when Flight Recorder is enabled. Here's the relevant mast link: https://fburl.com/mlhub/qia257xh Reviewed By: fduwjj Differential Revision: D61055124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133150 Approved by: https://github.com/fduwjj	2024-08-10 03:45:00 +00:00
joydddd	e888f401c5	Fix autotuning for flex_decoding (#132157 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132157 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #131559	2024-08-10 03:39:48 +00:00
soulitzer	05de2b2d0f	Revert "Construct NJT without graph breaks" (#133145 ) This reverts commit 911154271309667b55dfb963ec6384bd0048019b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145 Approved by: https://github.com/YuqingJ	2024-08-10 03:11:16 +00:00
Wanchao Liang	e890d888d9	[dtensor] move tensor constructors to a separate module (#133129 ) This is to ensure __init__.py only contain public APIs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129 Approved by: https://github.com/awgu, https://github.com/tianyu-l	2024-08-10 02:51:42 +00:00
sanchitintel	8fbd7d92a8	Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887 ) ## Summary As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond). WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations. The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue. Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel. While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. ### Performance #### AMX Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded. In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead. Benchmarked with unit-tests. Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442 The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel. #### AVX2/AVX512 micro-kernels Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437 ### Follow-up 1. int4 WoQ GEMM micro-kernel will also be added in a separate PR. 2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. E2E perf measurement should be done with #131310. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-10 02:01:04 +00:00
eqy	c89936eaa0	[CUDA][SDPA] Bump `grad_key` fudge factor in `test_flash_attention_vs_math_ref_grads` (#133051 ) Abates failures like `ValueError: grad_key Test error 1.592235639691353e-05 is greater than threshold 1.5236437320709229e-05!` that we've seen when bringing up newer versions of CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/133051 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2024-08-10 01:49:30 +00:00
James Wu	f037803290	Add ChromiumEventLogger, log FXGraphCache and AOTAutogradCache (#132864 ) This PR implements ChromiumEventLogger in all @dynamo_timed events. For each dynamo timed call, we log: - A start event before starting the function execution - An end event after finishing the function execution - An extra pair of start/end events for any phase names included in dynamo. Separately, this also gives us the ability to log instant events. I use them to log cache hits/misses as a first step. The little arrows on the bottom of the UI are cache hits/misses, and you can look at cache details by clicking each triangle. The outputted chromium trace events can be viewed in perfetto for a timeline of an execution. Here's what it looks like for a run of nanogpt: ![image](https://github.com/user-attachments/assets/cb9e6c7a-1acf-45e6-8a27-6651d9ae6132) And another with warm start: ![image](https://github.com/user-attachments/assets/cd9709bc-59ef-4da1-a7dd-10b1a0ab9b8f) Trace events are based around the JSON Event format: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview We may want to switch to the less deprecated Protobuf format later, but so far I don't see any features we care about supported there. Internal FB employees can see a link to this in the tlparse output: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpVi1FIl/dedicated_log_torch_trace_bb4zl_bc.log/index.html I'll also work on logging these Pull Request resolved: https://github.com/pytorch/pytorch/pull/132864 Approved by: https://github.com/aorenste	2024-08-10 01:15:53 +00:00
Huanyu He	de48d54042	[TorchRec] Add Support for FakeProcessGroup (#133039 ) Summary: # context * use FakeProcessGroup to mimic the multi-process tests * can use `_test_compile_fake_pg_fn` as the single-process VB compile test ``` from torchrec.distributed.tests.test_pt2_multiprocess import _test_compile_fake_pg_fn _test_compile_fake_pg_fn( rank=0, world_size=2, ) ``` reference: D59637444 Test Plan: # run test * run command and results: P1519228952, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpwMCK1E/index.html) ``` TORCH_TRACE=/var/tmp/tt TORCH_SHOW_CPP_STACKTRACES=1 TORCH_LOGS="+all" buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:test_pt2_multiprocess ``` Differential Revision: D56124045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133039 Approved by: https://github.com/ezyang	2024-08-10 01:10:47 +00:00
Avik Chaudhuri	3899465268	relax unification checks when size-like symbols can be 0 (#133112 ) Test Plan: Fixes test failure in https://www.internalfb.com/diff/D51127481 Differential Revision: D61031307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133112 Approved by: https://github.com/angelayi	2024-08-10 00:57:49 +00:00
chuanqiw	72f2b29bb0	[CI] disable xpu kineto build (#133069 ) Due to the xpu kineto support PR https://github.com/pytorch/pytorch/pull/130811 landed, but the xpu ci infra not ready for now. Disable kineto build as a temp WA Pull Request resolved: https://github.com/pytorch/pytorch/pull/133069 Approved by: https://github.com/seemethere	2024-08-09 23:58:50 +00:00
Alnis Murtovi	21302d5891	AutoHeuristic: script to generate data for mm (#131617 ) This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617 Approved by: https://github.com/eellison ghstack dependencies: #131615, #131616	2024-08-09 23:49:29 +00:00
Alnis Murtovi	e7512ab752	inductor mm autotuning: add back previously pruned configs (#131616 ) This PR adds back 10 configs for tuned_mm that were previously removed in https://github.com/pytorch/pytorch/pull/126570. The main idea is that we use 30 configs to autotune only when data is collected with AutoHeuristic. The learned heuristic will prune these 30 configs down to 10 configs, which reduces compilation time and at the same time might improve performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131616 Approved by: https://github.com/eellison ghstack dependencies: #131615	2024-08-09 23:49:29 +00:00
Alnis Murtovi	e5fa190e01	AutoHeuristic: tuned_mm (#131615 ) This PR enables AutoHeuristic to be used for `tuned_mm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131615 Approved by: https://github.com/eellison	2024-08-09 23:49:29 +00:00
Kurman Karabukaev	3b440f358c	[elastic collectives API] add missing rank tracing support (#132818 ) Optional option to detect missing ranks (that can be mapped to host info via `rank_tracing_decoder` lambda argument) in store barrier operation. This approach has been used in some form already, moving it to collectives API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132818 Approved by: https://github.com/d4l3k	2024-08-09 22:55:04 +00:00
Tom Ritchford	6beb2be2ed	Fix _dynamo.variables.torch_function.global_mangled_class_name (#132744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132744 Approved by: https://github.com/zou3519	2024-08-09 22:19:01 +00:00
Shivam Raikundalia	d2ecdcb2f7	[Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035 ) Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing. Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces Reviewed By: aaronenyeshi Differential Revision: D60541767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035 Approved by: https://github.com/aaronenyeshi	2024-08-09 21:54:54 +00:00
fduwjj	b0b4723062	[c10d] Rename PG name and PG ID attribute (#132915 ) As discussed in https://github.com/pytorch/pytorch/pull/132058. we think pg_uid and local_uid might be a better name for pg_name and pg_id. So this PR is trying to rename it. More PRs are needed to change on the logging and other places. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132915 Approved by: https://github.com/fegin ghstack dependencies: #132058	2024-08-09 21:26:56 +00:00
joydddd	4110cb6ba7	Add explicit GQA support. (#131559 ) ### tl;dr This PR adds GQA support to higher order op `flex_attention`. ## Details When `enable_gqa` is set to True, HOP `flex_attention(score_mod, query, key, value, block_mask, enable_gqa)` runs Group Query Attention(GQA), where the number of query heads (Hq) is a multiple of number of key/value heads (Hkv). Each group of query heads (`Hq//Hkv` heads) attends to a shared kv head. Otherwise, `flex_attention` assumes Multi Head Attention (MHA) where the number of query heads is equal the number of kv heads. The `score_mod` and `mask_mod` API are adapted accordingly to take `q_head` as head index. ``` def score_mod(score: torch.Tensor, batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor def mask_mod(batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor ``` ## Example ```python import torch from torch.nn.attention.flex_attention import flex_attention from torch.nn.attention.flex_attention import create_block_mask torch.manual_seed(0) def query_key_value_clones( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, dtype: torch.dtype = None, ): """Clones the query, key, and value tensors and moves them to the specified dtype.""" if dtype is None: dtype = query.dtype query_ref = query.clone().detach().to(dtype).requires_grad_(query.requires_grad) key_ref = key.clone().detach().to(dtype).requires_grad_(key.requires_grad) value_ref = value.clone().detach().to(dtype).requires_grad_(value.requires_grad) return query_ref, key_ref, value_ref # Lets create some input tensors # The input tensor has shape (batch_size, num_heads, seq_len, head_dim). # query and key/value can have different num_heads and seq_len # Here 8 query heads share one KV head. query = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) key = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) value = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) query1, key1, value1 = query_key_value_clones(query, key, value) # Lets create a score_modification. We take alibi_bias as an example. # score_mod takes batch index, query head index, query index, and key/value index. def _generate_alibi_bias(num_kv_heads: int, num_q_heads: int): def _alibi_bias( score: torch.Tensor, b: torch.Tensor, hq: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor, ) -> torch.Tensor: # Let's calculate kv head from query head index group = num_q_heads // num_kv_heads hkv = hq // group scale = torch.exp2(-((hkv + 1) * 8.0 / num_kv_heads)) return score + (token_kv - token_q) * scale return _alibi_bias # Let's apply a casual mask on top of it def causal_mask(b, h, q, kv): return q >= kv # Generate a block mask for our new mask_mod function. # The mask is broadcasted long head & batch dimensions. block_mask = create_block_mask(causal_mask, B=1, H=1, Q_LEN=2048, KV_LEN=2048) # Lets call flex_attention with our new score modification and block mask under eager mode. output = flex_attention(query, key, value, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True) # Now lets compile flex_attention and run the flex_attention kernel. compiled_flex_attention = torch.compile(flex_attention) out_compiled = compiled_flex_attention(query1, key1, value1, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True) torch.testing.assert_close(output, out_compiled, atol=5e-2, rtol=2e-2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131559 Approved by: https://github.com/drisspg	2024-08-09 21:25:35 +00:00
fduwjj	dc8bb2636c	[c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132920 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-08-09 21:08:20 +00:00
Shivam Raikundalia	78fa32a77b	Turn off Function Event Accumulation by Default (#133095 ) Summary: D56956245 added the ability to accumulate FunctionEvents across multiple cycles in order to perform statistical analysis on them all together. Although this can be useful, it uses too many CPU resources especially for long running jobs. For this reason, lets add a flag to the profiler to turn off this behavior by default, but still allow users to turn it on if they wish. Test Plan: Changed function count test to have acc_events passed in and check the amount of function events based on if flag is true or not Differential Revision: D61021490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133095 Approved by: https://github.com/briancoutinho, https://github.com/LucasLLC, https://github.com/aaronenyeshi	2024-08-09 20:47:20 +00:00
Yidi Wu	c44cb89e06	[export] detach constant tensors when they're not registered as buffer or parameter in unlift (#133031 ) Summary: Fixes T198245910. In previous diff D60532628 that causes the test failure, we fix the in-consistency caused by constant tensors is accidentally reigistered as buffer by deleting the buffer and re assign them as constant. However, this broke several existing tests in pyspeech when the exported program is re-traced with torch.jit.trace (which is an anti-pattern we probably should have some alignment), the jit tracer finds this constant tensor requiring grad and errors out. This PR force constant attr not requiring grad, which is the correct behavior. A better fix is finding out where the constants are created in user code and why it requires grad. But this has low roi so we warn user about it. Test Plan: See failures in T198245910. Differential Revision: D60974869 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133031 Approved by: https://github.com/angelayi	2024-08-09 20:33:52 +00:00
Wei Feng	cd307fb0b1	[FSDP2] reset FSDPParam.sharded_param in lazy_init (#132954 ) motivated by FSDP2 + DoRA https://github.com/pytorch/pytorch/issues/132721 after meta init, we need a user-defined function to move DoRALinear.magnitude from device=meta to device=cuda The problem is how to trigger reset_sharded_param or _apply to update FSDPParam. Otherwise lazy_init complains that DoRALinear.magnitude are still on device=meta credit to @awgu for chasing after DDP lazy_init to unblock the PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/132954 Approved by: https://github.com/awgu ghstack dependencies: #133059	2024-08-09 20:26:10 +00:00
Henry Tsang	78cf8df4a0	[aoti] forward fix of [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#133042 ) Summary: Forward fix of a test failure caused by D60773405. The idea of D60773405 is that we need to use absolute path. So we will want to use the older version of path for output_so and output_o. However, when I was copying the older definitions of output_so and output_o, I thought it was okay to simplify it a bit. See https://github.com/pytorch/pytorch/pull/131304#issuecomment-2270016609 Turns out I was wrong. Test Plan: ci Differential Revision: D60990594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133042 Approved by: https://github.com/hl475, https://github.com/desertfire	2024-08-09 18:53:27 +00:00
Wei Feng	472b0daeaa	[DDP][FSDP2] keep DTensor params for replicate(fully_shard) (#133059 ) current status: for `replicate(fully_shard)`, DDP lazy_init will convert DTensor into local tensor, and that breaks FSDP unshard this PR keeps FSDP params untouched during DDP lazy_init I came across it because of a CI error in FSDP2's unit test #132978 thanks @awgu for fix proposal Pull Request resolved: https://github.com/pytorch/pytorch/pull/133059 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-08-09 18:38:05 +00:00
abhishek-fujitsu	e66084f9bf	[BUG FIX] Refactor _scale_attn_mask_fusion_kernel to Use Runtime Argument Instead of Template Parameter (#132434 ) Description _[BUG FIX]_ This PR fixes a bug which happens during compilation with GCC 11.4 compiler in the FlashAttentionKernel.cpp file. This issue doesn't seem to be with PyTorch main branch but gets introduced with our SVE PR changes (https://github.com/pytorch/pytorch/pull/119571 ) + PyTorch main. See the CI Pipeline failing in our PR: https://github.com/pytorch/pytorch/actions/runs/9895714768/job/27336251795?pr=119571 ``` /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp during RTL pass: expand In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp:1: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp:290:57: internal compiler error: in emit_move_insn, at expr.c:3821 290 \| at::parallel_for(0, batchSize * num_head * qSlice, 1, [&](int64_t begin, int64_t end) { \| ^ 0xffffb03f73fb __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 0xffffb03f74cb __libc_start_main_impl ../csu/libc-start.c:392 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <file:///usr/share/doc/gcc-11/README.Bugs> for instructions. [5731/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/CatKernel.cpp.SVE256.cpp.o [5732/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/GridSamplerKernel.cpp.SVE256.cpp.o ``` This issue with compilation only happens with GCC 11.4 and works well with the latest GCC 12.3 compiler and also the Clang compiler. The issue is related to the check for `is_b_stride_zero` introduced as a template parameter (compile time check complexity) in the following commit: `5da428d9eb` which was added recently into FlashAttentionKernel.cpp file. This PR fixes the above compilation failure with GCC 11.4 compiler. cc : @Valentine233 @yanbing-j @mingfeima @malfet @jgong5 @r-barnes Pull Request resolved: https://github.com/pytorch/pytorch/pull/132434 Approved by: https://github.com/jgong5	2024-08-09 18:34:42 +00:00
Du Jiangcun	b41d62a3a2	Fix typo in docs of `all_gather` (#133066 ) Fix a typo of docs: ``` def all_gather(tensor_list, tensor, group=None, async_op=False): ... [tensor([0, 0], device='cuda:0'), tensor([0, 0], device='cuda:1')] # Rank 1 ``` `cuda:0` should be `cuda:1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133066 Approved by: https://github.com/awgu	2024-08-09 18:21:26 +00:00
Jannick Kremer	f3eab23c42	Fix typo in `mypy.ini` (#133097 ) A missing comma in the file list currently leads to errors when running mypy, introduced in #113745 Fixes #133096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133097 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-09 18:19:51 +00:00
PyTorch MergeBot	31ef900a65	Revert "added persistent option to buffers and namedbuffers (#132994 )" This reverts commit 8707c6dfacaed293ddc40cbb5ecf5841568df0e6. Reverted https://github.com/pytorch/pytorch/pull/132994 on behalf of https://github.com/PaliC due to breaking internal pyre tests ([comment](https://github.com/pytorch/pytorch/pull/132994#issuecomment-2278487672))	2024-08-09 18:14:53 +00:00
fduwjj	6c012f7217	[c10d][Log] Use pg_id instead of pg_name for logging prefix (#132058 ) When checking the logs of c10d, I found it showed that "[PG 7 rank 7]" which it actually means "[PG 1 rank 7]". So we need to use pg_id(aka, uid_) rather than pg_name_ because when creating subpgs, currently we need to call it multiple times, so this makes PG names are based on bumped up numbers (e.g, 7 rather than 1). Using pg_id is more accurate and consistent with other logging tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132058 Approved by: https://github.com/shengbao-zheng, https://github.com/shuqiangzhang	2024-08-09 18:14:10 +00:00
Zixi Qi	655ec07525	[ROCm] TunableOp logging improvements (#132173 ) Summary: TunableOp logging improvements: 1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it 2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup Test Plan: > PYTORCH_TUNABLEOP_VERBOSE=3 buck 2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab le-tuning ``` reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty Validator HIPBLASLT_VERSION=800-a15e4178 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694 GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138 GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673 missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192 finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1 ├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default ├──offset at 3 ...... ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s 2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708 2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876 2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144 2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856 ``` Differential Revision: D60468273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173 Approved by: https://github.com/mxz297, https://github.com/jeffdaily, https://github.com/eqy	2024-08-09 17:55:21 +00:00
fduwjj	d13e72fd6a	[c10d] set a shorter heartbeat detect timeout to avoid race with NCCL timeout (#133028 ) What we found recently is that: 1. Monitoring detect watchdog hang(no heartbeat) at same time as nccl timeout. This race leads to less useful debug info gets dumped to logs (such as CudaEventDestroy and GIL checker) 2. We don't kill the program if monitoring thread has not enabled but somehow still silently run the monitoring thread. Plus for users who feel this is too short, they should config TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133028 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-09 17:48:34 +00:00
Shangdi Yu	574cdf1232	[export] Merge functions in replace set_grad/autocast with HOO (#132724 ) Summary: as title Test Plan: CI Differential Revision: D60701648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132724 Approved by: https://github.com/ydwu4	2024-08-09 17:25:07 +00:00
Will Constable	2dbe5cb979	[C10D] Clarify warning for concurrent PG usage (#131895 ) Addresses a common misconception about safety of using multiple NCCL process groups from PyTorch. Notably, it IS safe to use multiple process groups, so long as communication operations from different groups are not allowed to overlap. (Overlap of communication operations from one group with compute operations IS ok). TODO: after getting feedback on the text, update other copies of the warning on other APIs Pull Request resolved: https://github.com/pytorch/pytorch/pull/131895 Approved by: https://github.com/fduwjj	2024-08-09 17:06:46 +00:00
leslie-fang-intel	bc57d5b6ff	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-08-09 16:56:57 +00:00
Yueming Hao	23b877cb54	[inductor]a less ambitious way to slove the scalar tensor (#132702 ) Fixes #121374 The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702 Approved by: https://github.com/eellison	2024-08-09 16:29:36 +00:00
PyTorch MergeBot	50595ecef4	Revert "[BE] Raise when the target model has scalar parameters (#132934 )" This reverts commit ea00036841b225330396f8d8f6ecf796f4826786. Reverted https://github.com/pytorch/pytorch/pull/132934 on behalf of https://github.com/clee2000 due to I think this broke distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter [GH job link](https://github.com/pytorch/pytorch/actions/runs/10314920655/job/28563430905) [HUD commit link](`ea00036841`). Dr CI is wrong, it is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132934#issuecomment-2278208789))	2024-08-09 15:30:34 +00:00
IvanKobzarev	065f7aa44b	[inductor] tensor_is_align fallbacking False if unbacked expr not comptime evaled (#132423 ) Currently if storage_offset is unbacked symbol and is_align can not be computed compiletime - it hard fails. Doing the best we can: adding guard_size_oblivious and fallback on False if can not be evaluated compiletime Pull Request resolved: https://github.com/pytorch/pytorch/pull/132423 Approved by: https://github.com/ezyang	2024-08-09 15:07:42 +00:00
Amit Agarwal (Ads AI HW Efficiency)	4bdb4bbd86	Fix fbcode AOTI GPU lowering for ARM64 hosts (#133017 ) Summary: Fix fbcode AOTI GPU lowering for ARM64 hosts Reviewed By: hl475 Differential Revision: D60969898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133017 Approved by: https://github.com/hl475	2024-08-09 14:05:13 +00:00
Chirag Pandya	f2bacd908a	[BE] Move function definitions to .cpp (#132927 ) Summary: Non-functional change. Move function definitions for NCCLTraceBuffer to .cpp files. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132927 Approved by: https://github.com/Skylion007, https://github.com/d4l3k ghstack dependencies: #132916	2024-08-09 13:52:29 +00:00
PyTorch MergeBot	465e071898	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )" This reverts commit 927b4c11143e047eb6e3430e4c7c912064572f1b. Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/nmacchioni due to breaking many tests ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2277738114))	2024-08-09 11:30:23 +00:00
Zhuoran Zhao	f565d16acb	Fix work-around item non-sync issue on AMD (#133054 ) Summary: Otherwise it will break FSDP code paths Test Plan: unit test see next diff for print message ``` sh ./scripts/lufang/amd/small_repro.sh ROCM_GET_SCALAR_ITEM_SYNC=1 sh ./scripts/lufang/amd/small_repro.sh ``` It will log "====== Async mode ======" or "====== Sync mode ======" correspondingly Differential Revision: D60995134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133054 Approved by: https://github.com/houseroad	2024-08-09 09:22:29 +00:00
Eddie Yan	927b4c1114	[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 ) Unblocks/unbreaks against newer CUTLASS (3.5+) CC @nWEIdia @xwang233 @ptrblck @thakkarV Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493 Approved by: https://github.com/Skylion007	2024-08-09 07:35:38 +00:00
Yiming Zhou	7b8ab7eb3e	[dynamo] Partially support random.Random class (#133037 ) This partially fixes the graph break issue when instantiating a `random.Random` class in Python. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133037 Approved by: https://github.com/anijain2305	2024-08-09 07:15:42 +00:00
Chien-Chin Huang	ea00036841	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu ghstack dependencies: #132908, #132933	2024-08-09 06:45:48 +00:00
xinan.lin	5707c6e952	[Fake tensor] Align the appearance of `device_put` op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823 (#132479 ) [Fake tensor] Align the appearance of device_put op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132479 Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/eellison	2024-08-09 05:31:00 +00:00
cyy	da65cfbdea	Remove unused Caffe2 macros (#132979 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979 Approved by: https://github.com/ezyang	2024-08-09 04:48:20 +00:00
cyy	05e8e87a69	[Submodule] Remove foxi (#132976 ) It is not used after removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976 Approved by: https://github.com/ezyang	2024-08-09 03:46:52 +00:00
Feng Shi	bb6eef8ed1	[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py This is part 2 pull request which deals with the 2nd case above: - The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. - Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels Differential Revision: D60067757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131675 Approved by: https://github.com/mlazos	2024-08-09 03:14:16 +00:00
Wanchao Liang	8875226d62	[dtensor] multi-dim mesh redistribute follow up (#133023 ) follow up from https://github.com/pytorch/pytorch/pull/131210 and added one test case from user in https://github.com/pytorch/pytorch/issues/132751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133023 Approved by: https://github.com/tianyu-l ghstack dependencies: #133022	2024-08-09 02:26:23 +00:00
Wanchao Liang	3b7edc12c6	[dtensor] more refactor to imports/paths (#133022 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/133022 Approved by: https://github.com/XilunWu, https://github.com/wz337	2024-08-09 02:26:23 +00:00
Avik Chaudhuri	22ea248aa8	dynamic shapes mismatch errors (#132982 ) Summary: When PyTree detects a structural mismatch between inputs and dynamic shapes, the error messages are quite horrible. This PR fixes these error messages by adding, for each kind of error, the path to the point where the error happens and an actionable reason for the error. Test Plan: added test with several cases Differential Revision: D60956976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132982 Approved by: https://github.com/yushangdi	2024-08-09 02:22:32 +00:00
cyy	8967d55b01	[18/N] Fix clang-tidy warnings in jit (#132963 ) Follows #132753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132963 Approved by: https://github.com/Skylion007	2024-08-09 01:27:32 +00:00
PyTorch MergeBot	313aa151da	Revert "[ROCm] TunableOp logging improvements (#132173 )" This reverts commit 9cca0494b9d5c89c0a1100aee9477ed8ca7d527b. Reverted https://github.com/pytorch/pytorch/pull/132173 on behalf of https://github.com/PaliC due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132173#issuecomment-2276966242))	2024-08-09 01:04:57 +00:00
Edward Z. Yang	4101dd14c2	Make debugging backends accept and ignore options kwargs from torch.compile (#132892 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132892 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-08-09 00:49:45 +00:00
wz337	0ff0bf3d31	[Replicate] Fix replicate with DeviceMesh initialization (#133024 ) A follow up on https://github.com/pytorch/pytorch/pull/132339. `get_parent_mesh` is replaced by `get_root_mesh`. In addition, modify a few places that parent mesh is mentioned in test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133024 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-08-09 00:45:47 +00:00
Shunting Zhang	10c2168b31	[pt2-bench] use larger multiplier for smaller tensors for a few models (#132952 ) Fix https://github.com/pytorch/pytorch/issues/132922 and https://github.com/pytorch/pytorch/issues/132924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132952 Approved by: https://github.com/eellison, https://github.com/jansel	2024-08-09 00:09:21 +00:00
Shangdi Yu	3c5b246d3c	[export] Remove Proxy from exported programs and modules (#132956 ) Summary: Remove Proxy from exported programs and modules because they cannot be deepcopied or pickeled. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r qat_conv2d buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_fold_bn_erases_bn_node ``` Differential Revision: D60940832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132956 Approved by: https://github.com/angelayi	2024-08-09 00:00:20 +00:00
Scott Wolchok	e2b94923ba	[PyTorch] Speed up decomposed quantize_per_channel (#133029 ) Similar to D60871396 (#132828). Differential Revision: [D60978385](https://our.internmc.facebook.com/intern/diff/D60978385/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133029 Approved by: https://github.com/cccclai	2024-08-08 23:48:34 +00:00
Jiashen Cao	fa8c34301a	[ts-migration]: Quantized ops to standard ops pass. (#133026 ) #### Description Transform quantized operation properly. Add de/quantization before and after the quantized operation. #### Test Plan `pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133026 Approved by: https://github.com/angelayi	2024-08-08 23:10:17 +00:00
drisspg	45cf8ef557	add impls for required for nt ops (#132710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132710 Approved by: https://github.com/jbschlosser ghstack dependencies: #131060	2024-08-08 23:09:38 +00:00
drisspg	1434e0b121	Add a private _safe_softmax (#131060 ) # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060 Approved by: https://github.com/jbschlosser	2024-08-08 23:09:38 +00:00
Edward Z. Yang	1f66487c69	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh	2024-08-08 23:07:23 +00:00
Nichols A. Romero	f25df31008	TunableOp more unit test follow-up (#130065 ) More unit tests for preventing TunableOp regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130065 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-08-08 22:42:16 +00:00
Blaine Burton Rister	3d0de6e1cd	[Inductor] Add config option to force higher-dimensional tiling (#132937 ) Fixes #125077 Feature This PR creates a new Inductor config, `config.triton.prefer_nd_tiling`, which is disabled by default. When enabled, this encourages the Triton code to use as many tiling dimensions as possible. This simplifies indexing expressions for discontiguous tensors, resulting in expressions like `5 * x + 8 * y` as opposed to `5 * (x // 7) + 8 * (y % 9)`. This allows us to find more block pointers than we normally would. We should now see simplified indexing expressions as long as: 1. All discontiguous reads/writes have the same shape. 2. The number of discontiguous dimensions is less than `config.triton.max_tiles`. Here's an example kernel (elementwise add of views) with ND tiling disabled: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 21 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 7 x1 = (xindex // 7) x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (9x1)), xmask) tmp1 = tl.load(in_ptr1 + (x0 + (9x1)), xmask) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[21], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` And here's the version with it enabled: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 3 xnumel = 7 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[7, 3], strides=[1, 7], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` With this feature enabled, we get a discontiguous strided block pointer. Previously, this would only have worked for specific shapes, like powers of 2 or multiples of the maximum block size. With this PR, we can support arbitrary shapes so long as we have enough tiles to cover all discontiguous dimensions. Test plan This PR adds some tests for pointwise ops with discontiguous tensors. - Test that we can generate block pointers for views with odd shapes like `(5,7)`, `(9,3,5)`, etc. - Test that we can generate block pointers for a single discontiguous dim in 3D and 4D tensors. - Test that we generate a 2D tiling for a 5D tensor with two discontiguous dims. This case doesn't generate a block pointer, but it checks that the output code is at least correct. This PR also parametrizes some existing tests to run with and without `triton.prefer_nd_tiling`. That way, we ensure this feature doesn't break existing usage. Since this setting isn't enabled on most tests, I also created https://github.com/pytorch/pytorch/pull/132935 to test what happens when `triton.prefer_nd_tiling=True` by default. None of the failures seem related to invalid tiling, so I think this feature is safe to merge. Limitations and follow-ups I can see two main improvements which would expand the usefulness of this feature: 1. This feature currently only works for pointwise kernels, since reductions are never tiled. As a follow-up, we could enable tiled reductions to extend these benefits to reduction kernels. 2. The usefulness of this feature depends on `triton.config.max_tiles`. This is currently restricted to 2 by default, although it can be increased to 3 in certain cases. To support more discontiguous dims, we might consider expanding support for 3D tiling, or even supporting ND tiling, by mapping an ND "virtual" launch grid onto Triton's 3D launch grid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132937 Approved by: https://github.com/jansel, https://github.com/eellison	2024-08-08 22:11:56 +00:00
Randolf Scholz	8707c6dfac	added persistent option to buffers and namedbuffers (#132994 ) Fixes #85235 Alternative to PR https://github.com/pytorch/pytorch/pull/129655, implements 3-valued option (None or bool). - adds keyword only argument `persistent: Optional[bool] = None` to `nn.Module.buffers` - updated docstrings slightly. - added test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132994 Approved by: https://github.com/mikaylagawarecki	2024-08-08 21:39:01 +00:00
Zixi Qi	9cca0494b9	[ROCm] TunableOp logging improvements (#132173 ) Summary: TunableOp logging improvements: 1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it 2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup Test Plan: > PYTORCH_TUNABLEOP_VERBOSE=3 buck 2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab le-tuning ``` reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty Validator HIPBLASLT_VERSION=800-a15e4178 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694 GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138 GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673 missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192 finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1 ├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default ├──offset at 3 ...... ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s 2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708 2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876 2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144 2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856 ``` Differential Revision: D60468273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173 Approved by: https://github.com/mxz297, https://github.com/jeffdaily	2024-08-08 21:24:16 +00:00
Menglu Yu	cd30861857	[PT2][Optimus] Update unbind_cat_to_view pass to include more complicated cases (#132831 ) Summary: We found recent CMF and IGCTR has more complicated patterns to optimize in order to remove as many stack/cat nodes as possible, we thus design such patterns Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174939423652 Network: Up: 113KiB Down: 112KiB (reSessionID-11c9b598-af3a-4727-8f02-ccb1471d092b) Jobs completed: 27. Time elapsed: 5:45.8s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ### cmf ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 -n ``` P1515072258 Counter({'pattern_matcher_nodes': 2170, 'pattern_matcher_count': 1766, 'normalization_pass': 402, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 51, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 3, 'scmerge_cat_removed': 3, 'unbind_stack_pass': 3, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1}) ### ig_ctr ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 -n ``` P1515087739 Counter({'pattern_matcher_nodes': 1832, 'pattern_matcher_count': 1564, 'extern_calls': 378, 'normalization_pass': 345, 'normalization_aten_pass': 49, 'fxgraph_cache_miss': 18, 'batch_aten_mul': 6, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'batch_linear_post_grad': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'unbind_cat_to_view_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'split_stack_to_cats_pass': 2, 'split_cat_to_slices_pass': 1}) # e2e testing the following new patterns ``` "split_stack_to_cats_pass": {}, "split_cat_to_slices_pass": {}, "unbind_cat_to_view_pass": {}, ``` Note that you can tune the hyper-parameter "threshold_to_cat " for these patterns, and the minimum value you give should be at least 2. The larger the value, the less aggressive to do the node slicing but to keep the cat, and the default value is 10. You can tune the parameters by setting threshold_to_cat. For example ``` "split_stack_to_cats_pass": {"threshold_to_cat": 10}, "split_cat_to_slices_pass": {"threshold_to_cat": 10}, "unbind_cat_to_view_pass": {"threshold_to_cat": 10}, ``` Note that the default value may not be optimal, it's based on my experiments on CMF and IGCTR, you are more than welcome to tune the value to find the best threashold for you. For example, in the cmf local run, - when "threshold_to_cat" is 2 P1515072258 =============Print full analysis for cmf_shrink================ \| Metric \| Value \| \|:-------------------\|:----------------\| \| Batch size \| 10 \| \| Latency \| 156.07 ms \| \| Model size \| 844357184 bytes \| \| Flops/example \| 583.53 G \| \| TFLOPS \| 37.39 \| \| MFU \| 4.67% \| \| Activation/example \| 1707.49 MB \| - when "threshold_to_cat" is 10 P1515912635 =============Print full analysis for cmf_shrink================ \| Metric \| Value \| \|:-------------------\|:----------------\| \| Batch size \| 10 \| \| Latency \| 155.09 ms \| \| Model size \| 844357184 bytes \| \| Flops/example \| 583.53 G \| \| TFLOPS \| 37.63 \| \| MFU \| 4.70% \| \| Activation/example \| 1707.49 MB \| ads_dper3:164562cbe29f6c5aea4546cf3d463b87 training_platform:5e455c643c52940bb4567017f4c7ba83 ## cmf baseline f588717948 proposal f588719502 ### QPS and NE results {F1793304642} {F1793304664} {F1793304689} {F1793304683} ### Compilation time reduction zoomer link: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1045728747213538&tab=pt2_metrics Compile time for that frame is reduced to 1 min from 9 min. ### trace analysis baseline trace link https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588722004-TrainingApplication%2F0%2Frank-1.Aug_06_00_03_46.3617.pt.trace.json.gz&bucket=pyper_traces proposal trace link https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588723545-TrainingApplication%2F0%2Frank-1.Aug_05_23_54_56.3647.pt.trace.json.gz&bucket=pyper_traces {F1793312804} {F1793312867} From the trace, we can see that the green part (introduced by split cat) has been reduced significantly with our new patterns. Differential Revision: D60750275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132831 Approved by: https://github.com/jackiexu1992	2024-08-08 21:18:01 +00:00
Chirag Pandya	40767e8468	[BE] rename testHelperPrefix test (#132916 ) Summary: Re-enable testHelperPrefix test that was erroneously disabled in CI. Fixes #50701 Test Plan: Test passes locally: ``` ❯ ./TCPStoreTest --gtest_filter=TCPStoreTest.testHelperPrefix Running main() from /data/users/cpio/pytorch/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = TCPStoreTest.testHelperPrefix [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from TCPStoreTest [ RUN ] TCPStoreTest.testHelperPrefix [W807 12:01:31.531576727 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) returned 0, likely a timeout [W807 12:01:31.531663710 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) timed out after 100ms [ OK ] TCPStoreTest.testHelperPrefix (314 ms) [----------] 1 test from TCPStoreTest (314 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (314 ms total) [ PASSED ] 1 test. ╭─ ~/local/pytorch/build/bin main *1 +1 ···················· ✔ /home/cpio/local/a/pytorch-env  cpio@devgpu011 ─╮ ╰─ ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132916 Approved by: https://github.com/Skylion007	2024-08-08 20:54:52 +00:00
Alnis Murtovi	7bd0732cbd	Fix flaky internal mixed_mm tests (#133015 ) This PR fixes flaky internal tests: - The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit. - The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for). Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-08-08 20:32:12 +00:00
Guilherme Leobas	a9954d22f8	Raise exception if torch.func.* calls torch.compile functions (#128736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128736 Approved by: https://github.com/zou3519	2024-08-08 20:21:44 +00:00
Wanchao Liang	b845068db2	[dtensor] refactor examples folder (#132914 ) as titled: 1. remove checkpoint example as it's not maintained 2. refactor convnext example to use torchrun 3. refactor comm mode feature example to sit in one file Pull Request resolved: https://github.com/pytorch/pytorch/pull/132914 Approved by: https://github.com/wz337	2024-08-08 20:03:14 +00:00
Prachi Gupta	c326533999	[ROCm][Inductor] Enable AOT Inductor CPP UTs for ROCm (#131521 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131521 Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/malfet	2024-08-08 19:49:56 +00:00
Isuru Fernando	de288e2203	Fix inf value reduction in non persistent reduction for scans (#132293 ) Fixes https://github.com/pytorch/pytorch/issues/132107 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132293 Approved by: https://github.com/peterbell10	2024-08-08 19:02:32 +00:00
Xilun Wu	322c9d03a0	[FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result (#130760 ) Fixes issue #129229 #129206 Summary 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 3. Re-enabled the tests that were disabled in #129519 test `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130760 Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wz337 ghstack dependencies: #126697, #130239, #132391, #131408	2024-08-08 18:15:29 +00:00
Shangdi Yu	21906ddaba	[AOTI] Fix complex64 not defined (#132810 ) Partially fixes #122980 - change cpp type mapping for complex64 to std::complex<float> - add `aoti_torch_item_complex64` and `aoti_torch_scalar_to_tensor_complex64`. - add `expensiveCopyToTensor()` to convert `ArrayRefTensor<T>` type to `AtenTensorHandle` type. - if we want to fully fix #122980, we still need to let ArrayRef and MiniArrayRef to consider underlying storage number of elements. See more details in https://github.com/pytorch/pytorch/pull/132347 (#132347 broke some internal tests, so we need more work before landing it). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132810 Approved by: https://github.com/desertfire	2024-08-08 18:08:23 +00:00
Zain Rizvi	ac95b2a2f2	Migrate slow self-hosted jobs to Amazon2023 AMI (#131771 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 (for tracking: signal on Aug 6: https://hud.pytorch.org/pytorch/pytorch/pull/131771?sha=38bc4755567527fad5279203ddef534ac132ea94) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131771 Approved by: https://github.com/seemethere	2024-08-08 17:33:57 +00:00
Joel Schlosser	75eb66afc0	Support 'non-contiguous with holes' NJTs for contiguous clone() (#132776 ) It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes. This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this. NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776 Approved by: https://github.com/ani300, https://github.com/soulitzer ghstack dependencies: #131898, #131704, #131937	2024-08-08 17:08:11 +00:00
PyTorch MergeBot	3ec9ec03a8	Revert "[pipelining] Add schedule runtime for lowered schedule (#130488 )" This reverts commit b73d4b6555dd6b5a39d70d741099b83190eb31f0. Reverted https://github.com/pytorch/pytorch/pull/130488 on behalf of https://github.com/PaliC due to breaking distributed tests internally (that should be running in OSS) ([comment](https://github.com/pytorch/pytorch/pull/130488#issuecomment-2276266909))	2024-08-08 16:57:50 +00:00
Zhengxu Chen	942ffd1b2d	Make the __module__ name of HOO to be always "torch.ops.higher_order" (#132775 ) Summary: It seems that we can just make this the default so that in the future all the ops printed in the graph should be like torch.ops.higher_order Test Plan: CI Differential Revision: D60530900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132775 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-08-08 16:55:09 +00:00
Scott Wolchok	eeb6ad0744	[quant] Speed up dequantize_per_channel (#132828 ) Tensor-wise operations are much faster than looping over tensor elements. Rewrite loop in dequantize_per_channel to use whole-Tensor operations accordingly. Differential Revision: [D60871396](https://our.internmc.facebook.com/intern/diff/D60871396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132828 Approved by: https://github.com/cccclai	2024-08-08 16:44:41 +00:00
Thanh Ha	dfc5bb0099	Login to Meta's ECR when using non-meta runner (#132870 ) The project depends on fetching container images from Meta's ECR repo so when run on non-meta runners we need to ensure that we also login to Meta's ECR too. Closes pytorch/ci-infra#252. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132870 Approved by: https://github.com/ZainRizvi	2024-08-08 16:34:46 +00:00
Sam Larsen	4a4dc9d6d9	[inductor] Disable remote caching in failing test_cpu_repro tests (#132955 ) Summary: These tests are failing stress tests internally because of remote caching. Most already have local cache disabled; disable remote cache as well Test Plan: Ran stress tests locally for each of the affected tests Differential Revision: D60940081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132955 Approved by: https://github.com/leslie-fang-intel	2024-08-08 16:20:56 +00:00
Edward Yang	9d5c85c499	Move exir.delegate to PyTorch core to enforce no out-of-tree HOPs (#132525 ) Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible. Test Plan: sandcastle, ossci Differential Revision: D60674615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132525 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-08-08 16:06:56 +00:00
rzou	4ee5547b37	[triton_op] Skip HOP dispatch when possible (#132822 ) The capture_triton decorator returns a function that goes through the triton kernel wrapper HOP. This is useful for make_fx tracing and non-strict export. However, the HOP dispatch is slow (~1ms) and not necessary in certain situations. This PR skips going through the HOP dispatch for any capture_triton-wrapped triton kernels that are registered as implementations to a `@triton_op` custom operator. We do this by creating a new thread-local flag that controls if the capture_trition-wrapped triton kernel goes through HOP dispatch or not. Test Plan: - new test and existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132822 Approved by: https://github.com/SherlockNoMad	2024-08-08 15:56:40 +00:00
PyTorch MergeBot	b885ad8fce	Revert "[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 )" This reverts commit 73c083e02cb6093bb3adf06b7ccdf5c4a2e7591c. Reverted https://github.com/pytorch/pytorch/pull/132487 on behalf of https://github.com/PaliC due to this pr is breaking inductor tests internally ([comment](https://github.com/pytorch/pytorch/pull/132487#issuecomment-2276142742))	2024-08-08 15:47:04 +00:00
Janani Sriram	0ca8f66e3a	[NestedTensor] Modify softmax on ragged dimension to allow for 2D nested tensors (#132812 ) Summary: Modify `softmax` on the ragged dimension, where `ragged_idx == 1`, to allow for 2D nested tensors. This diff now enables a `softmax` operation on tensors of shape `(B, )`, where `` is the ragged dimension. Extend existing `softmax` unit tests to include 2D nested tensors using the `include_2d_tensor=True` keyword argument. Test Plan: Verify that existing and modified unit tests pass using the following commands: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_softmax ``` ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op ``` Reviewed By: davidberard98 Differential Revision: D60780975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132812 Approved by: https://github.com/davidberard98	2024-08-08 15:41:28 +00:00
Howard Huang	c4071c4707	Remove noqa: G004 warnings (#132917 ) Remove logging messages with f-strings (G004), https://docs.astral.sh/ruff/rules/logging-f-string/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/132917 Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: #132888	2024-08-08 15:18:53 +00:00
Xu Han	9db5bfccdc	[inductor] disable test_torchinductor failed UTs on Windows (#132973 ) Disable failed UTs of `test/inductor/test_torchinductor.py` on Windows. TODO: Debug and enable these UTs, after CI ready. Local test: <img width="857" alt="image" src="https://github.com/user-attachments/assets/3d9da274-f147-474e-92f1-a6d3ed8aa003"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132973 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-08 14:56:10 +00:00
Jean Schmidt	51ddcde110	[BE] Introduces runner variants for amzn2023 to simplify lf-scale-config.yml and lf-canary-scale-config.yml (#132918 ) Depends on https://github.com/pytorch/test-infra/pull/5541 to be deployed on LF and Meta infra Test for this changes are in this PR: https://github.com/pytorch/test-infra/pull/5542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132918 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi	2024-08-08 14:38:34 +00:00
PyTorch MergeBot	6f99e97f0a	Revert "[ts-migration]: Support quantized operation transformation (#131915 )" This reverts commit 0e8541766fe5ed58c54aa530eee8e34832539199. Reverted https://github.com/pytorch/pytorch/pull/131915 on behalf of https://github.com/ezyang due to test broken on windows `0e8541766f` ([comment](https://github.com/pytorch/pytorch/pull/131915#issuecomment-2275974907))	2024-08-08 14:30:35 +00:00
Syed Tousif Ahmed	42cd397a0e	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-08 14:29:56 +00:00
PyTorch MergeBot	d1f73fd844	Revert "[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 )" This reverts commit 902c6f3a191fb2ecb1976895b3e9eaae4b257b89. Reverted https://github.com/pytorch/pytorch/pull/132770 on behalf of https://github.com/ezyang due to Removed API was recommitted ([comment](https://github.com/pytorch/pytorch/pull/132770#issuecomment-2275749689))	2024-08-08 12:54:34 +00:00
Edward Z. Yang	902c6f3a19	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062, #132767, #132769	2024-08-08 12:03:25 +00:00
Edward Z. Yang	0e43175e22	[BE] Get rid of unnecessary inner_torch_dispatch method (#132769 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132769 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062, #132767	2024-08-08 12:03:25 +00:00
Edward Z. Yang	35fd4391bc	Format torch.fx.experimental.proxy_tensor.py (#132767 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132767 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062	2024-08-08 12:03:18 +00:00
Edward Z. Yang	b4e2411f6f	Big enough count to trigger stack overflow (#132062 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132062 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421	2024-08-08 12:03:12 +00:00
Edward Z. Yang	aec6332356	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-08 12:03:06 +00:00
Edward Z. Yang	54efd43022	[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675 Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519 ghstack dependencies: #132674	2024-08-08 12:03:00 +00:00
Edward Z. Yang	361db32d47	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-08 12:02:54 +00:00
PyTorch MergeBot	0f19d4150b	Revert "[inductor]a less ambitious way to slove the scalar tensor (#132702 )" This reverts commit b483ca05a91f2876b0f1f5a435fa264f5467762d. Reverted https://github.com/pytorch/pytorch/pull/132702 on behalf of https://github.com/ezyang due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/132702#issuecomment-2275642109))	2024-08-08 11:59:38 +00:00
xinan.lin	ec49796b8f	[Inductor] Support use_libdevice_for_f64 for pointwise ops on XPU, align with CUDA. (#132739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132739 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-08-08 11:50:10 +00:00
Xuehai Pan	24dee99cb7	Populate submodules of `torch._C` to `sys.modules` recursively (#132216 ) See comment: `e9d1c26275/torch/__init__.py (L938-L950)` This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216 Approved by: https://github.com/ezyang	2024-08-08 10:20:25 +00:00
Wanchao Liang	7f71f2a997	[dtensor] improve docs and comments (#132683 ) as titled, fixed typos in various comments and improve the public documentations Pull Request resolved: https://github.com/pytorch/pytorch/pull/132683 Approved by: https://github.com/XilunWu ghstack dependencies: #131210, #132682	2024-08-08 09:24:58 +00:00
Wanchao Liang	9e37e73e01	[dtensor] refactor and improve readability of _dispatch.py (#132682 ) as titled. It also changes some comments of _op_schema.py to make them update to date Pull Request resolved: https://github.com/pytorch/pytorch/pull/132682 Approved by: https://github.com/XilunWu ghstack dependencies: #131210	2024-08-08 09:24:58 +00:00
leslie-fang-intel	ac960dced1	Skip Reformer for Dynamic size testing (#132468 ) Summary As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API `mark_dynamic` `3a355c1891/torch/_dynamo/decorators.py (L228-L230)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468 Approved by: https://github.com/ezyang	2024-08-08 08:25:53 +00:00
Yu, Guangye	9c5e0d47fe	Add xpu_cmake_macros.h to xpu build (#132847 ) # Motivation fix https://github.com/pytorch/pytorch/issues/132971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132847 Approved by: https://github.com/EikanWang	2024-08-08 08:06:49 +00:00
abhishek-fujitsu	751c744ad0	Optimize sort kernel for contiguous tensors (#132236 ) Introduces enhancement for SortingKernel.cpp for cases where both the values and indices tensors have a stride 1, indicating contiguous memory layouts. The changes include: 1. A new function `sort_kernel_impl`, encapsulating the core sorting logic for distinct types of tensor accessors. 2. Modifications to the `sort_kernel` function to utilize `sort_kernel_impl`. It now checks for tensor strides and optimally handles contiguous and non-contiguous tensor scenarios. 3. The optimization aims to improve cache locality and efficiency in memory access for contiguous tensor sorts. 4. Enhanced Code Readability and Structure: The restructuring of the sorting process improves clarity and maintenance by clearly defining how different stride scenarios are handled, making the code more transparent and easier to understand. Tests have been conducted across various tensor sizes and shapes to ensure stability and reliability of the change. The result of running the `test/test_sort_and_select.py` test suite is consistent between the main branch, and this modified branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132236 Approved by: https://github.com/jgong5	2024-08-08 07:01:25 +00:00
Wanchao Liang	83e4af203d	[dtensor] rewrite redistribute algorithm for multi-dim mesh (#131210 ) As titled, this PR rewrite the current redistribute algorithm to make the multi-mesh dim redistribute logic more sound. The previous algorithm works numerically but it could incur additional non-necessary steps when transforming shardings in the multi-dimesnion device mesh, i.e. Let's say we want to transform from (S(1), S(1)) -> (S(1), S(2)). The previous algorithm yield the following steps: * mesh_dim 1: S(1) -> R, mesh_dim 0: S(1) -> R * mesh_dim 0: R -> S(1), mesh_dim 1: R -> S(2) Although it works semantically but it incurs two allgather transformations, where it should really only incur a S(1) -> S(2) on the mesh dim 1. The rewrite algorithm basically take it in a more principled way: 1. we check if src_spec have sharding, if not, we don't need to worry about nested sharding case, as sharding would always be in order, so we just go from left to right in the placements and add the transform steps 2. if src_spec have sharding, this potentially means that there would be either nested or mis-aligned shardings. So we first tranverse from right to left to check if there's mis-aligned sharding as the above example showed, if there is, we replicate that mesh dimension so that it unshard the nested sharding 3. we tranverse again from left to right to generate the transformation after we unshard the nested sharding should also fix https://github.com/pytorch/pytorch/issues/132751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131210 Approved by: https://github.com/tianyu-l	2024-08-08 06:50:30 +00:00
wz337	479d460471	[DeviceMesh] Add a private _flatten() API for device_mesh (#132632 ) Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that: ``` mesh_3d = init_device_mesh( self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"), ) dp_cp_mesh = mesh_3d["dp", "cp"] # flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',)) # flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',)) flattened_dp_cp_mesh = dp_cp_mesh._flatten() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632 Approved by: https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #132310, #132311, #132339	2024-08-08 06:46:42 +00:00
Jiashen Cao	0e8541766f	[ts-migration]: Support quantized operation transformation (#131915 ) #### Description Transform quantized operation properly. Add de/quantization before and after the quantized operation. #### Test Plan `pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131915 Approved by: https://github.com/angelayi	2024-08-08 06:34:53 +00:00
Chien-Chin Huang	9e584d0c05	[BE] Test foreach optimizer for FSDP1 optimizer state_dict (#132933 ) Summary: When fixing https://github.com/pytorch/pytorch/issues/130810, we suspected FSDP1 optimizer state_dict cannot handle foreach optimizer, which is not correct. For FSDP1, whether optimizer uses foreach or not does not matter. Since we already have tests for non-foreach version optimizer, this PR changes the distributed state_dict tests for FSDP1 to use foreach optimizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132933 Approved by: https://github.com/c-p-i-o ghstack dependencies: #132908	2024-08-08 06:13:10 +00:00
angelayi	a270800f0b	[export][reland] Add print_readable to unflattened module (#132817 ) Reland https://github.com/pytorch/pytorch/pull/128617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132817 Approved by: https://github.com/pianpwk	2024-08-08 06:05:30 +00:00
Chien-Chin Huang	745665d8b5	[BE] Using with_temp_dir for test_distributed_checkpoint (#132908 ) Fixes https://github.com/pytorch/pytorch/issues/113936 Fixes https://github.com/pytorch/pytorch/issues/113937 The original way to broadcast the path seems to cause desync issues. `with_temp_dir` has been used for other checkpoint related tests without problems. Change the tests to use `with_temp_dir` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132908 Approved by: https://github.com/awgu, https://github.com/Skylion007	2024-08-08 05:42:19 +00:00
daitian1995	aff48f7378	Autoselect default device in FSDP construction. (#127609 ) There are still some differences between CUDA and non-CUDA custom devices when construct FSDP because CUDA is selected as the default device. For example, when construct FSDP from CPU model and device_id is not passed, device_handle will choose CUDA as default device. This PR will autoselect the real device as the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609 Approved by: https://github.com/awgu	2024-08-08 05:25:17 +00:00
Edward Z. Yang	4a1edbe475	Disable SymDispatchMode when torch.compile'ing (#132433 ) Partially addresses https://github.com/pytorch/pytorch/issues/132417 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433 Approved by: https://github.com/ydwu4	2024-08-08 05:02:43 +00:00
xinyu-intel	5ae979ab10	[Dynamo] Support torch.autograd._is_checkpoint_valid (#132611 ) Hi, we got `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor bool call_function <function _is_checkpoint_valid at 0x7f0b0d22e290>` while tracing activation [checkpointing function in deepspeed](`324ee65cb0/deepspeed/runtime/activation_checkpointing/checkpointing.py (L630)`). Consider to add it to constant_folding list which is similar with https://github.com/pytorch/pytorch/pull/126196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132611 Approved by: https://github.com/anijain2305, https://github.com/williamwen42	2024-08-08 04:05:08 +00:00
IvanKobzarev	4fd0d594a1	[sym_shapes] Not eval sym expression for printing storage_offset (#132911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132911 Approved by: https://github.com/ezyang	2024-08-08 03:49:29 +00:00
Yueming Hao	b483ca05a9	[inductor]a less ambitious way to slove the scalar tensor (#132702 ) Fixes #121374 The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702 Approved by: https://github.com/eellison	2024-08-08 03:42:21 +00:00
Andrew Gu	ac6398b630	[FSDP2] Follow-up fix to correct relaxed overlap test (#132953 ) The previous PR forgot to include dummy all-gathers before backward, so the reference time was too short, causing the test to still fail. I verified the test passes locally. This should close https://github.com/pytorch/pytorch/issues/120961 (again). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132953 Approved by: https://github.com/weifengpy ghstack dependencies: #132869	2024-08-08 03:24:46 +00:00
cyyever	636a7c4859	[13/N] Use std::optional (#132527 ) Follows #132361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132527 Approved by: https://github.com/ezyang	2024-08-08 03:16:28 +00:00
Bin Bao	fd874b799f	[AOTI][refactor] Update MKLDNN ops cpp wrapper support (#132367 ) Summary: Set op_overload for MKLDNN ops so that cpp_kernel_name and python_kernel_name are constructed from there. This is an important step towards support those MKLDNN ops in the ABI-compatible mode, because we will need to read schema from op_overload for generating correct fallback op call in C++. Differential Revision: [D60909798](https://our.internmc.facebook.com/intern/diff/D60909798) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132367 Approved by: https://github.com/leslie-fang-intel, https://github.com/angelayi	2024-08-08 03:02:29 +00:00
Yiming Zhou	c69b2d24e3	[dynamo] Support remove method of set (#132943 ) Fixes https://github.com/pytorch/pytorch/issues/132800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132943 Approved by: https://github.com/anijain2305	2024-08-08 02:43:19 +00:00
Animesh Jain	194ec49d27	[dynamo][lists][stable diffusion] Do not add source on list slice (#132912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132912 Approved by: https://github.com/williamwen42 ghstack dependencies: #132806, #132899	2024-08-08 02:23:07 +00:00
Angela Yi	45d0e90bd3	[export] Allow str outputs (#132808 ) Summary: Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1478413606130179/ Test Plan: CI Differential Revision: D60850712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132808 Approved by: https://github.com/ydwu4	2024-08-08 02:20:59 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	4ca616e6d4	Disable sparse tests in export (#132824 ) Summary: Dynamo doesn't trace through sparse tensors in fbcode. So we should disable tests that run sparse tensors in export. We should do this to make the CI green internally. Test Plan: Before: Tests finished: Pass 1409. Fail 71. Fatal 0. Skip 90. Build failure 0 After: Tests finished: Pass 1408. Fail 0. Fatal 0. Skip 162. Build failure 0 Differential Revision: D60870543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132824 Approved by: https://github.com/BoyuanFeng	2024-08-08 01:45:12 +00:00
zdevito	fb6b001cde	Disable expandable segments IPC in fbcode, because some jobs seem to be failing. (#132890) seem to be failing. https://fb.workplace.com/groups/1405155842844877/permalink/8867182216642165/ Differential Revision: [D60912371](https://our.internmc.facebook.com/intern/diff/D60912371/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132890 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-08 01:42:32 +00:00
Rachel Guo	5709375d56	[AOTI][tooling][1/n] Add intermediate value debug printer (#132323 ) Summary: Context: Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866) The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process. Changes: 1. Added a simple initial debug printer helper to print out tensor values 2. Added a filter option to selectively dump tensor values. Usage: Sample cmd : ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda ``` Sample outputs : ``` [ before_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -2.25655 Max value: 2.32996 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -12.0839 Max value: 11.6878 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)] . ---------------------------------------------------------------------- Ran 1 test in 10.867s OK ``` The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below: ``` torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out'] ``` In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further `torch.load()`. Test Plan: Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part) `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda` Differential Revision: D60538496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323 Approved by: https://github.com/ColinPeppler	2024-08-08 01:39:59 +00:00
David Berard	59f4725b49	[NJT] manually autocast in SDPA handling (#132835 ) When autocasting is turned on, right now SDPA w/ NJT won't be autocasted. This PR adds manual "autocasting" logic in sdpa.py - at the beginning, it just checks if autocasting is enabled, and if so, it casts the inputs in the way you would expect if autocasting was actually running. Why normal autocasting won't work: * NJT intercepts the `__torch_function__` call for scaled_dot_product_attention, which, AFAIK, happens before we get to any dispatcher logic, and then calls efficient attention or flash attention. So autocasting the scaled_dot_product_attention op won't work; we never call the aten op for scaled_dot_product_attention, so we won't ever run autocasting for it. * If we try to add autocasting handling for `_flash_attention_forward` or `_efficient_attention_forward`, then autocasting will _run_, but it will have the wrong semantics: sdpa.py's handling will run first, and it will do backend selection based on the uncasted inputs to SDPA. This also means that if the inputs to the SDPA call don't have uniform types, the sdpa.py implementation will fail checks (this is the specific issue we're targeting). Alternative: "just change the backend selection logic for NJT to be autocast aware, but don't actually do the autocast; then, add `_(flash\|efficient)_attention_forward` to autocasting rules". I think this would work too. But it's arguably better to make the backend-selection logic and actual-autocast-behavior use the same implementation, in case the implementations are different. Differential Revision: [D60879916](https://our.internmc.facebook.com/intern/diff/D60879916) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132835 Approved by: https://github.com/soulitzer	2024-08-08 01:36:57 +00:00
Yidi Wu	bbf568aac8	Split of "[reland] [export] fix zero arg export in training_ir and constant tensor handling" (#132307 ) Summary: A re-land of D60006710. Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing. edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure: a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state. The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design edit 2: Also fix the inconsistency of graph signatures when param_constant is marked as lifted_tensor_constants but it's registered as parameters in the output of ep.module(). Differential Revision: D60532628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132307 Approved by: https://github.com/zhxchen17	2024-08-08 01:36:16 +00:00
Howard Huang	0f90ffe94a	Remove ProcessGroupRoundRobin (#132888 ) `_round_robin_process_groups` is deprecated and should be removed. `258f47fc0b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp (L10-L12)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132888 Approved by: https://github.com/Skylion007, https://github.com/wanchaol, https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-08 01:07:40 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
Xu Han	a9036e1cf8	[inductor] raise unsupport msg in capture_pre_autograd_graph on Windows (#132841 ) Debuged with @leslie-fang-intel , and we found that: https://github.com/pytorch/pytorch/issues/132561 and https://github.com/pytorch/pytorch/issues/132569 are all failed by `capture_pre_autograd_graph` not work well on Windows. So, we added some code to raise message and let end user known that. Detailed: For https://github.com/pytorch/pytorch/issues/132561 ```cmd Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor yield File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run self._callTestMethod(testMethod) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod method() File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper method(args, kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 1515, in wrapper fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 399, in wrapper fn(args, *kwargs) File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 1737, in test_qat_conv2d self._test_quantizer( File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 553, in _test_quantizer m = capture_pre_autograd_graph( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows") RuntimeError: capture_pre_autograd_graph not yet supported on Windows To execute this test, run the following from the base repo dir: python test\quantization\pt2e\test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor.test_qat_conv2d This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` For https://github.com/pytorch/pytorch/issues/132569 ```cmd Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor yield File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run self._callTestMethod(testMethod) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod method() File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper method(args, *kwargs) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_torchinductor.py", line 11218, in new_test return value(self) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\testing.py", line 312, in _fn return fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\contextlib.py", line 79, in inner return func(args, *kwds) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_cpu_cpp_wrapper.py", line 155, in fn _, code = test_torchinductor.run_and_get_cpp_code( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_inductor\utils.py", line 1863, in run_and_get_cpp_code result = fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 415, in wrapper fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 367, in wrapper fn(args, **kwargs) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1668, in test_qlinear_gelu_cpu self._qlinear_unary_cpu_test_helper((torch.randn((2, 4)),), gelu) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1615, in _qlinear_unary_cpu_test_helper self._test_common( File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 165, in _test_common convert_model = _generate_qdq_quantized_model( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 2949, in _generate_qdq_quantized_model export_model = capture_pre_autograd_graph( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows") RuntimeError: capture_pre_autograd_graph not yet supported on Windows To execute this test, run the following from the base repo dir: python test\inductor\test_cpu_cpp_wrapper.py -k DynamicShapesCppWrapperCpuTests.test_qlinear_gelu_cpu_dynamic_shapes_cpp_wrapper This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- W0807 13:24:34.291000 11228 torch\_export\__init__.py:64] +============================+ W0807 13:24:34.291000 11228 torch\_export\__init__.py:65] \| !!! WARNING !!! \| W0807 13:24:34.291000 11228 torch\_export\__init__.py:66] +============================+ W0807 13:24:34.291000 11228 torch\_export\__init__.py:67] capture_pre_autograd_graph() is deprecated and doesn't provide any function guarantee moving forward. W0807 13:24:34.291000 11228 torch\_export\__init__.py:68] Please switch to use torch.export instead. ``` Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132841 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-08-08 00:28:07 +00:00
Tobias Ringwald	441c1c03d5	Prevent an unnecessary device -> host copy for CuPy arrays when not explicitly setting a device in torch.as_tensor. (#132595 ) See title. Until now, calling `torch.as_tensor` on a CuPy array would return a CPU tensor, when not providing a device. This is most likely not desired. Fixes #132553 ```python3 import torch import cupy as cp cupy_arr = cp.asarray([1, 2, 3]) # Default case t = torch.as_tensor(cupy_arr) # New behavior, same device as cupy_arr now, was cpu before print(t.device) # cuda:0 # Explicitly set device t = torch.as_tensor(cupy_arr, device='cpu') print(t.device) # cpu # Implicit default device torch.set_default_device('cpu') t = torch.as_tensor(cupy_arr) print(t.device) # cpu # Default device via context manager torch.set_default_device('cuda') with torch.device('cpu'): t = torch.as_tensor(cupy_arr) print(t.device) # cpu # Unset default device torch.set_default_device(None) t = torch.as_tensor(cupy_arr) # New behavior, same device as cupy_arr now, was cpu before print(t.device) # cuda:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132595 Approved by: https://github.com/ezyang	2024-08-08 00:26:58 +00:00
HDCharles	374747818d	Run performance test non-alternately (#131935 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). other changes: need to add torch.compiler.cudagraph_mark_step_begin() to avoid the slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards also updated the torchao APIs to the current versions X-link: https://github.com/pytorch/benchmark/pull/2394 Test Plan: python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune (should all be ~1.0 0.997x 1.006x 0.994x Reviewed By: xuzhao9 Differential Revision: D60252821 Pulled By: HDCharles Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935 Approved by: https://github.com/xuzhao9	2024-08-08 00:23:20 +00:00
Edward Z. Yang	f16d87eeff	Print where raw cprofile lives (#132866 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132866 Approved by: https://github.com/albanD	2024-08-08 00:13:29 +00:00
Will Constable	b73d4b6555	[pipelining] Add schedule runtime for lowered schedule (#130488 ) Creates a new runtime that shifts complexity from runtime to ahead-of-time. The existing runtime (PipelineScheduleMulti) accepts a compute-only schedule (forward, backward, weight) actions only are specified, and it infers the communication operations at runtime. Compared to that runtime, PipelineScheduleRuntime has less logic that happens at runtime and relies on lowering passes to transform the compute-only schedule to add communications. Advantages include - easier to verify the correctness by dumping a compute+comm schedule - posible to manually edit the compute+comm schedule if the lowering heuristics are insufficient Functionality included inside the PipelineScheduleRuntime is limited to - accepting a compute-only schedule and lowering it to add comms - executing the compute or comm operations specified by the given schedule - handling work.wait() automatically by calling it just before the matching compute operation (for RECV ops) or at the end of step (for SEND ops) Follow ups for later PRs - Some refactoring should be done to replace PipelineScheduleMulti with this runtime - Optimizer execution is not considered (e.g. for zero-bubble cases) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488 Approved by: https://github.com/H-Huang	2024-08-08 00:08:03 +00:00
Edward Z. Yang	9282e6ca78	Don't use _disable_current_modes as decorator (#132809 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132809 Approved by: https://github.com/albanD ghstack dependencies: #132801, #132802, #132804	2024-08-07 23:59:46 +00:00
Edward Z. Yang	42226ca3a3	Don't use use_lazy_graph_module as decorator (#132804 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132804 Approved by: https://github.com/albanD ghstack dependencies: #132801, #132802	2024-08-07 23:59:46 +00:00
Edward Z. Yang	5e4d8eb831	Don't generate stack entry for DebugContext.wrap (#132802 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132802 Approved by: https://github.com/albanD ghstack dependencies: #132801	2024-08-07 23:59:38 +00:00
Edward Z. Yang	708a99e52a	Stop using with_fresh_cache_if_config as decorator (#132801 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132801 Approved by: https://github.com/albanD	2024-08-07 23:59:32 +00:00
Howard Huang	c3e51c09ed	[PP] Add get_schedule_class util (#132768 ) Add a function to map a string to a class instance for schedules. This allows users to select a schedule based on a string command line argument and removes the need for glue code (e.g. in torchtitan) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132768 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-07 23:51:03 +00:00
Alnis Murtovi	383f2ac914	AutoHeuristic: mixed_mm H100 heuristic (#132685 ) H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic. ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 1562 604 145 2311 1.522201 1.077722 10.399141 3.134170 1.034802 2061 2 test entropy 5 0.01 361 164 24 549 1.443590 1.079169 8.159173 3.105360 1.197973 500 2 ``` gpt-fast speedups \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 109.95 \| 220.63\| 2 \| \| 1 \| 11 \| 109.65 \| 210.92\| 1.92 \| \| 4 \| 7 \| 149.04 \| 625.80\| 4.19 \| \| 4 \| 11 \| 149.56 \| 494.64\| 3.30 \| \| 8 \| 7 \| 293.68 \| 956.72\| 3.25 \| \| 8 \| 11 \| 294.48 \| 925.60\| 3.14 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685 Approved by: https://github.com/eellison	2024-08-07 23:48:01 +00:00
angelayi	c327710a87	[export] Publicize validate function (#132777 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/132777 Approved by: https://github.com/zhxchen17	2024-08-07 23:10:05 +00:00
Chien-Chin Huang	21d4c48059	Allow distributed breakpoint to skip the first few calls (#129511 ) Summary: PDB allows to do conditional breakpoint but the ability won't work in the distributed environment. We can still do conditional breakpoint by doing the following: ``` counter = 0 global counter count += 1 if counter > 100: dist.breakpoint() ``` This PR makes dist.breakpoint() support this feature as a syntax sugar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129511 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-08-07 21:57:37 +00:00
Animesh Jain	acad2050c1	[easy][dynamo] Add tx as an arg in getitem_const (#132899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132899 Approved by: https://github.com/yanboliang ghstack dependencies: #132806	2024-08-07 21:35:41 +00:00
vasiliy	700a11fdd4	Make inductor kernel metadata comments more descriptive (#126698 ) Summary: A couple of improvements to the generated comments in inductor kernels: 1. Makes the nodes in the comment topologically sorted, I think having them alphabetically sorted is a gotcha. I was always confused on why the sorting in the comments did not match the code. 2. Adds a printout of the aten graph fragment corresponding to the current inductor kernel, to make it easier to map from aten code to inductor code Example float8-overhead-related inductor kernel comment after this PR: ``` # kernel path: /tmp/torchinductor_vasiliy/27/c27ts3rdw56ns7od5j6ovdnhxphished2lcu3adclzzixoo7khg5.py # Source Nodes: [weight_fp8], Original ATen: [aten.mul, aten.clamp, aten._to_copy] # Source node to ATen node mapping: # weight_fp8 => clamp_max_1, clamp_min_3, convert_element_type_10, convert_element_type_11, convert_element_type_9, mul_3 # Graph fragment: # %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %convert_element_type_8), kwargs = {}) # %convert_element_type_9 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%mul_3, torch.float32), kwargs = {}) # %clamp_min_3 : [num_users=1] = call_function[target=torch.ops.aten.clamp_min.default](args = (%convert_element_type_9, -448.0), kwargs = {}) # %clamp_max_1 : [num_users=1] = call_function[target=torch.ops.aten.clamp_max.default](args = (%clamp_min_3, 448.0), kwargs = {}) # %convert_element_type_10 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%clamp_max_1, torch.bfloat16), kwargs = {}) # %convert_element_type_11 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%convert_element_type_10, torch.float8_e4m3fn), kwargs = {}) triton_poi_fused__to_copy_clamp_mul_5 = async_compile.triton('triton_', ''' ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126698 Approved by: https://github.com/ezyang ghstack dependencies: #126573	2024-08-07 21:25:09 +00:00
vasiliy	48f7bdbbe1	aot_autograd: copy metadata from fw to bw nodes (#126573 ) Summary: Uses the `seq_nr` field (introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129) to map the aot_autograd fx bw nodes to the corresponding fw nodes, and copy the metadata over. I am trusting the `seq_nr` mapping in the linked PR here. I did some validation with a toy LLaMa 3 8b training run and the mapping seemed correct. I am also trusting that the forward is single threaded, since `seq_nr` is thread local. If this isn't always true, we'll need to also plumb `thread_id` through the same machinery which is populating `seq_nr`. I'd like to use this data in a future PR to make inductor kernels easily attributable to the nn.Module path in modeling land, to make it easier to do performance debugging. Test Plan: ``` // 1. unit test python test/dynamo/test_aot_autograd.py -k test_aot_sequence_nr // 2. manual test // run LLaMa 3 8B fw + bw with torch.compile, print out the inductor graphs // seen in `torch/_inductor/utils.py::get_kernel_metadata`, they seemed // right to me. ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126573 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-08-07 21:25:09 +00:00
Jez Ng	260e7cb143	Make CUDA device properties's `__repr__` output actually printable (#132863 ) Previously we would write the UUID bytes directly, leading to 'invalid UTF-8 sequence' errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132863 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-08-07 21:08:43 +00:00
Aryan	525fdc0f95	[docs] fix incorrect example in `convert_conv3d_weight_memory_format` (#129318 ) The current example fails when using `torch.channels_last`, and the docs are slightly incorrect for the 3d case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129318 Approved by: https://github.com/albanD	2024-08-07 20:06:59 +00:00
Boyuan Feng	6a348e5e57	[CUDAGraph] Warn once if too many distinct sizes (#132832 ) Warn once if there are too many distinct sizes for cudagraph, so we can avoid spamming logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132832 Approved by: https://github.com/eellison	2024-08-07 19:48:06 +00:00
David Berard	e76bd0b603	[BE] put "show_dispatch_trace()" print logic in .cpp file (#132717 ) I find myself occasionally trying to modify this to get additional debug info. Recompiling takes forever after modifying these lines, because the .h file is depended on by a huge number of files. If we move this logic into a helper function and put it in the .cpp file, recompilation will be a lot faster when adding debug here. Tested with a local DEBUG=1 build (which is needed to use `TORCH_SHOW_DISPATCH_TRACE=1`) and verified basic sanity - i.e. it still prints `[call]`, etc. Differential Revision: [D60804331](https://our.internmc.facebook.com/intern/diff/D60804331) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132717 Approved by: https://github.com/soulitzer, https://github.com/bdhirsh	2024-08-07 19:43:29 +00:00
Mengwei Liu	7830373662	Update owner for BC test (#132891 ) Add @larryliu0820 to `/test/forward_backward_compatibility/check_forward_backward_compatibility.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132891 Approved by: https://github.com/albanD	2024-08-07 19:42:04 +00:00
Xu Han	59bbaea3a7	[inductor] disable capture_pre_autograd_graph related UTs on Windows (#132848 ) Contined to https://github.com/pytorch/pytorch/pull/132841 We disable `capture_pre_autograd_graph` related UT on Windows. Disable `test_lstm_packed_change_input_sizes` and `test_multihead_attention` UTs on Windows. TODO: Turn on them after fix `capture_pre_autograd_graph` issue on Windows. ## Local Test: Linux is not skiped: <img width="1387" alt="image" src="https://github.com/user-attachments/assets/28dfbb4b-d9c0-4d5b-be84-d7b3697bcd3f"> And we can skiped them on Windows: <img width="853" alt="image" src="https://github.com/user-attachments/assets/e96ebcf8-9bf3-43aa-93fd-fb33d3743573"> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132848 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-07 19:38:03 +00:00
Matthew Hoffman	7ea8374c0e	`nn.ModuleList.__getitem__` overloads (#132834 ) Overloads so that you can get more specific type info based on how you are indexing. ```python from torch import nn module_list = nn.ModuleList(32 * [nn.Linear(2, 2)]) # before: reveal_type(module_list[0]) # Type of "module_list[0]" is "Module \| ModuleList" reveal_type(module_list[:1]) # Type of "module_list[: 1]" is "Module \| ModuleList" # now: reveal_type(module_list[0]) # Type of "module_list[0]" is "Module" reveal_type(module_list[:1]) # Type of "module_list[: 1]" is "ModuleList" ``` Co-authored-by: Skylion007 <Skylion007@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132834 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-07 19:25:23 +00:00
Lu Fang	83fa7f871f	Work around item non-sync issue on AMD (#132772 ) Differential Revision: D59669714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132772 Approved by: https://github.com/ZhengkaiZ, https://github.com/izaitsevfb	2024-08-07 18:58:11 +00:00
PyTorch MergeBot	ff81ca8e0c	Revert "Populate submodules of `torch._C` to `sys.modules` recursively (#132216 )" This reverts commit 672ce4610e41386da9763e07375b0879dc351905. Reverted https://github.com/pytorch/pytorch/pull/132216 on behalf of https://github.com/PaliC due to was breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/132216#issuecomment-2274112397))	2024-08-07 18:45:00 +00:00
Catherine Lee	4fe6a5dc34	Move slow tests to be in repo (#132379 ) Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly. The job uses the same environment as the commit hash. It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant Remove references to the old file and set up testing to read from the new file instead The old update cadence was every day, the new one is every week The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing. While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues. Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this) We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379 Approved by: https://github.com/huydhn	2024-08-07 18:42:56 +00:00
Chen, Zejun	26b0011fb8	[XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811 ) As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin. Associated PR to introduce kineto-based XPU profiler into kineto: https://github.com/pytorch/kineto/pull/961 Also updates the Kineto Submodule to include XPU changes. Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811 Approved by: https://github.com/aaronenyeshi	2024-08-07 18:41:37 +00:00
PyTorch MergeBot	07551887b8	Revert "Disable SymDispatchMode when torch.compile'ing (#132433 )" This reverts commit 63eb06c0512b636a34caf041eab6fbc0726fc7ee. Reverted https://github.com/pytorch/pytorch/pull/132433 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132433#issuecomment-2274105080))	2024-08-07 18:41:28 +00:00
Jeff Daily	ca713b8393	llvm update for backward-breaking APIs in 18 and 19 (#132825 ) Related to #130661, #129797. Based on the LLVM tagged releases, these LLVM_VERSION_MAJOR guards are accurate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132825 Approved by: https://github.com/dcci, https://github.com/Skylion007	2024-08-07 18:31:40 +00:00
PyTorch MergeBot	a9ff190867	Revert "Consolidate SymDispatchMode into ProxyTensorMode (#132674 )" This reverts commit ffdf48e63b94930c81f05b06444721109d0b243d. Reverted https://github.com/pytorch/pytorch/pull/132674 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))	2024-08-07 18:25:33 +00:00
PyTorch MergeBot	9d476fee53	Revert "[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675 )" This reverts commit c2bccfd4311fe905ff78c0977281b8e642bb10d6. Reverted https://github.com/pytorch/pytorch/pull/132675 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))	2024-08-07 18:25:33 +00:00
Sebastien Roy	f2ad3c89b0	fix dtype mismatch in lobpcg eigen solver (#132762 ) Fixes #132761 If rerr value is_complex, test against the real part. Since the rerr variable holds a norm calculation, the imaginary part will be 0.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132762 Approved by: https://github.com/albanD	2024-08-07 18:20:46 +00:00
PyTorch MergeBot	1749025081	Revert "Fix infinite recursion while walking to submodules (#132763 )" This reverts commit 063a45ed27c3001bba44ea2161d188ec2314d428. Reverted https://github.com/pytorch/pytorch/pull/132763 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132763#issuecomment-2274059792))	2024-08-07 18:20:27 +00:00
Animesh Jain	25df063f04	[dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806 ) Fixes https://github.com/pytorch/pytorch/issues/132551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806 Approved by: https://github.com/williamwen42	2024-08-07 18:19:49 +00:00
Xilun Wu	40ce0a53bb	[FSDP][dtensor] add FSDP2+TP distributed state dict test (#131408 ) Test `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131408 Approved by: https://github.com/fegin ghstack dependencies: #126697, #130239, #132391	2024-08-07 18:17:12 +00:00
Xilun Wu	ad0ce89050	[3/N][dtensor] Strided Sharding offset calculation util (#132391 ) Summary 1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding). 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. Test `test/distributed/_tensor/test_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132391 Approved by: https://github.com/wanchaol ghstack dependencies: #126697, #130239	2024-08-07 18:17:12 +00:00
Xilun Wu	0b0c660c02	[2/N][dtensor] Strided Sharding shard_to_replicate (#130239 ) Summary This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding. Test `pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding` `pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130239 Approved by: https://github.com/wanchaol ghstack dependencies: #126697	2024-08-07 18:17:06 +00:00
Xilun Wu	92a17f454a	[1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic (#126697 ) Summary This PR adds a new private placement type `_StridedShard` for FSDP2 + TP style tensor sharding. The previously used `Shard` placement type cannot produce correct `full_tensor()` result because it assumes the tensor to be first sharded over `dp` mesh dimension then `tp` mesh dimension which does not hold true in FSDP2 + TP case. Test `pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126697 Approved by: https://github.com/wanchaol	2024-08-07 18:17:02 +00:00
PyTorch MergeBot	123d9ec5bf	Revert "Loads .pyd instead of .so in MemPool test for windows (#132749 )" This reverts commit 37ab0f33854fafdf9bf4f575260329ffcd960d13. Reverted https://github.com/pytorch/pytorch/pull/132749 on behalf of https://github.com/syed-ahmed due to Seems like periodic is still failing: `7c79e89bc5` ([comment](https://github.com/pytorch/pytorch/pull/132749#issuecomment-2274041302))	2024-08-07 18:08:44 +00:00
Andrew Gu	a62710c820	[FSDP2] Relaxed overlap test to address CI flakiness (#132869 ) This tries to fix https://github.com/pytorch/pytorch/issues/120961. This is a similar situation as https://github.com/pytorch/pytorch/pull/132116. The overlap tests were written strictly based on a precise calculation of what compute/communication should be non-overlapped vs. overlapped. This is done via `torch.cuda._sleep()`, which takes inputs in cycles, so we must convert from milliseconds to cycles via `get_cycles_per_ms()`, which is computed once and cached. Variation in CI can cause this `get_cycles_per_ms()` value to be inaccurate when the FSDP overlap tests run. Thus, we decide to relax the overlap tests to just make sure the overlapped runs are faster than a baseline without overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132869 Approved by: https://github.com/weifengpy	2024-08-07 17:37:03 +00:00
cyy	32a284c275	[9/N] Fix clang-tidy warnings in aten/src/ATen (#132842 ) Follows #132728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132842 Approved by: https://github.com/Skylion007	2024-08-07 16:54:21 +00:00
chilli	ffd0d92c18	fix autotuning init issues (#132837 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132837 Approved by: https://github.com/yanboliang	2024-08-07 16:36:47 +00:00
wz337	8b50d5398f	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-07 16:13:11 +00:00
Matthew Hoffman	258f47fc0b	Add `padding_side` to `pad_sequence` with `"left"` and `"right"` options (`"right"` as default) (#131884 ) Fixes #10536 Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work! As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral. The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](`9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)`), which requires only a very small change in the C++ code. Here are the benchmarks of the new implementation! `float32`: ![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb) `bool`: ![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e) Code: ```python from __future__ import annotations import random import time from typing import Literal import numpy as np import torch def pad_sequence_with_flips( sequences: list[torch.Tensor], batch_first: bool = False, padding_value: int \| float \| bool = 0.0, padding_side: Literal["left", "right"] \| str = "left", ) -> torch.Tensor: if padding_side == 'right': padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value) elif padding_side=='left': padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value) # pyright: ignore[reportArgumentType] padded_sequence = padded_sequence.flip(int(batch_first)) else: raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}") return padded_sequence sequence_lengths: list[int] = [] flip_left_pad_times: list[float] = [] flip_left_pad_times_std: list[float] = [] left_pad_times: list[float] = [] left_pad_times_std: list[float] = [] RUNS_PER_LOOP: int = 100 for i in range(1, 7): sequence_length = i * int(1e6) // 6 sequence_lengths.append(sequence_length) sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)] inner_left_pad_times: list[float] = [] inner_right_pad_times: list[float] = [] inner_flip_left_pad_times: list[float] = [] inner_flip_right_pad_times: list[float] = [] for _ in range(RUNS_PER_LOOP): start = time.perf_counter() torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left") end = time.perf_counter() inner_left_pad_times.append(end - start) start = time.perf_counter() pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left") end = time.perf_counter() inner_flip_left_pad_times.append(end - start) left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times)) left_pad_times_std.append(np.std(inner_left_pad_times)) flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times)) flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times)) print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}") import matplotlib.pyplot as plt plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left") plt.scatter(sequence_lengths, left_pad_times) plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^') plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)") plt.scatter(sequence_lengths, flip_left_pad_times) plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^') plt.xlabel("Sequence Length") plt.ylabel("Time (s)") plt.legend(loc="upper right") # Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794 # Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562 # Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863 # Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877 # Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024 # Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751 ``` Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884 Approved by: https://github.com/ezyang	2024-08-07 15:53:07 +00:00
PyTorch MergeBot	780310fed7	Revert "Only thunkify proxies in some situations (#132421 )" This reverts commit bb99008c9e7c357b88047bcd6971dc2078341484. Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](`bb99008c9e`). Test got added in `f50621989b` which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))	2024-08-07 15:29:54 +00:00
PyTorch MergeBot	de9b8a42c1	Revert "Add support for other backends in get_preferred_device (#132118 )" This reverts commit c184ac0f6b6d2482cf300d852fde6370a1c1e086. Reverted https://github.com/pytorch/pytorch/pull/132118 on behalf of https://github.com/clee2000 due to I think this broke distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10279901233/job/28456599072) [HUD commit link](`c184ac0f6b`). Dr CI classification is wrong, the failure is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132118#issuecomment-2273729288))	2024-08-07 15:22:42 +00:00
cyy	13fa59580e	Enable clang-tidy on aten/src/ATen/cpu (#132830 ) Expands code coverage of clang-tidy to aten/src/ATen/cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/132830 Approved by: https://github.com/Skylion007	2024-08-07 14:44:17 +00:00
Antoni Viros	ed97fb77f9	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-08-07 14:18:53 +00:00
Joel Schlosser	fb146fc3c6	Only store necessary tensor_dict fields in node meta (#132805 ) Fixes #132290 This PR attempts a more invasive / complete solution than the one from #132338, which removes immediate tensor fields from the `tensor_dict` copy stored in node meta. The approach taken here is to store only those fields of the `tensor_dict` which are absolutely utilized somewhere else. So far, this appears to be limited to: * `_dynamo_static_input_type` * `tag` (at least in the tests). Discussion at #94080 appears to indicate this is depended on for export (CI may point out more) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132805 Approved by: https://github.com/mlazos	2024-08-07 13:35:16 +00:00
Edward Z. Yang	7c79e89bc5	Stop using clear_frame as decorator (#132778 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132778 Approved by: https://github.com/albanD ghstack dependencies: #132774	2024-08-07 11:53:18 +00:00
Edward Z. Yang	bb99008c9e	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-07 11:51:17 +00:00
Danielmic	32f9a809c7	Replace [[unlikely]] with unlikely(x) (#130816 ) Do not use `[[unlikely]]` as its c++20 language features, see https://en.cppreference.com/w/cpp/language/attributes/likely Fixes https://github.com/pytorch/pytorch/issues/130815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130816 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet	2024-08-07 10:38:13 +00:00
zengxian	8c8eb9670a	[CI] Enable inductor UT test on avx512 (#132645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132645 Approved by: https://github.com/desertfire	2024-08-07 10:22:40 +00:00
Syed Tousif Ahmed	37ab0f3385	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-07 09:58:52 +00:00
xinyu-intel	8333ecf085	Support hasattr tracing for more PythonModuleVariable (#132731 ) Fixes #132237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132731 Approved by: https://github.com/EikanWang, https://github.com/yanboliang	2024-08-07 09:15:17 +00:00
Nicolas Macchioni	c8c964f950	[inductor] check best templates first for fusions (#132829 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132829 Approved by: https://github.com/eellison	2024-08-07 07:48:00 +00:00
Jeeja	c184ac0f6b	Add support for other backends in get_preferred_device (#132118 ) Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118 Approved by: https://github.com/awgu	2024-08-07 07:19:20 +00:00
wz337	87053132ea	[DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339 ) Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0)) ``` We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0)) ``` With this change, we will have two types of meshes in an environment. 1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing. 2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339 Approved by: https://github.com/wanchaol ghstack dependencies: #132310, #132311	2024-08-07 07:01:12 +00:00
leslie-fang-intel	dc00eeb0f4	[Dynamo] fix incorrect kwargs in create_proxy (#132723 ) ## Summary Fix https://github.com/pytorch/pytorch/issues/132642, the implementation of `create_proxy` requires to pass-in `kwargs` explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132723 Approved by: https://github.com/aorenste	2024-08-07 06:26:24 +00:00
Nikita Shulga	2206a3de00	[Compile] Speedup int8-to-float conversion on aarch64 (#132676 ) With this change following snippet: ```cpp #include <ATen/cpu/vec/vec.h> void int8tofloat(int8_t* in, float* out) { auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in, 8); auto tmp1 = at::vec::convert<float>(tmp0); tmp1.store(out); } ```, which is core of the algorithm generated by cpu_inductor for the following compiled function: ```python @torch.compile def to_float(x): return x.to(torch.float) ``` changes from ```assembly int8tofloat(signed char, float): 0000000000000000 stp x29, x30, [sp, #-0x10]! 0000000000000004 mov x29, sp 0000000000000008 sub x9, sp, #0x30 000000000000000c and sp, x9, #0xffffffffffffffe0 0000000000000010 adrp x8, 0 ; 0x0 0000000000000014 ldr x8, [x8] 0000000000000018 ldr x8, [x8] 000000000000001c str x8, [sp, #0x28] 0000000000000020 ldr s0, [x0] 0000000000000024 sshll.8h v0, v0, #0x0 0000000000000028 sshll.4s v0, v0, #0x0 000000000000002c scvtf.4s v0, v0 0000000000000030 str q0, [sp] 0000000000000034 ldr s0, [x0, #0x4] 0000000000000038 sshll.8h v0, v0, #0x0 000000000000003c sshll.4s v0, v0, #0x0 0000000000000040 scvtf.4s v0, v0 0000000000000044 str q0, [sp, #0x10] 0000000000000048 mov x8, sp 000000000000004c ld1.4s { v0, v1 }, [x8] 0000000000000050 st1.4s { v0, v1 }, [x1] 0000000000000054 ldr x8, [sp, #0x28] 0000000000000058 adrp x9, 0 ; 0x0 000000000000005c ldr x9, [x9] 0000000000000060 ldr x9, [x9] 0000000000000064 cmp x9, x8 0000000000000068 b.ne 0x78 000000000000006c mov sp, x29 0000000000000070 ldp x29, x30, [sp], #0x10 0000000000000074 ret 0000000000000078 bl 0x78 ``` to ```assembly 0000000000000000 ldr d0, [x0] 0000000000000004 sshll.8h v0, v0, #0x0 0000000000000008 sshll.4s v1, v0, #0x0 000000000000000c scvtf.4s v1, v1 0000000000000010 sshll2.4s v0, v0, #0x0 0000000000000014 scvtf.4s v2, v0 0000000000000018 st1.4s { v1, v2 }, [x1] 000000000000001c ret ``` and improves perf of `python3 torchchat.py generate stories110M --num-samples 3 --quantize '{"linear:int8" : {"groupsize" : 0}}' --compile --device cpu` from 56 to 98 tokens per sec on MacBook M1 Pro Pull Request resolved: https://github.com/pytorch/pytorch/pull/132676 Approved by: https://github.com/desertfire	2024-08-07 06:26:05 +00:00
Sun, Jiayi	4faa0e3efb	[Inductor] support masked vectorization for the tail_loop (#126526 ) Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance. Example: ``` import torch import torch.nn as nn class GN(nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): for _ in range(3): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps); } #pragma omp simd simdlen(8) for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0))]; tmp_acc0 = welford_combine(tmp_acc0, tmp0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; tmp15.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/em/cemtujj65j5txpqlxc7w4pcunpmvz3qtiudkc5ocxxhcmdlknw2m.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; tmp15.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) ``` Co-authored-by: CaoE <e.cao@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126526 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-07 06:00:12 +00:00
Apurva Jain	8bc5ef563e	Grouped Query Attention (#132689 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Differential Revision: D60772086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689 Approved by: https://github.com/drisspg	2024-08-07 05:35:36 +00:00
Nicolas Macchioni	527f104a69	add L2 cache size to device properties (#132819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819 Approved by: https://github.com/eellison	2024-08-07 04:55:06 +00:00
cyy	bfeb45e46b	[17/N] Fix clang-tidy warnings in jit (#132753 ) Follows #132604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132753 Approved by: https://github.com/Skylion007	2024-08-07 03:47:54 +00:00
cyy	03480213de	[8/N] Fix clang-tidy warnings in aten/src/ATen (#132728 ) Follows #132727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132728 Approved by: https://github.com/ezyang	2024-08-07 02:44:17 +00:00
Menglu Yu	919e384247	[PT2][Optimus] Add unbind_stack_to_cat_pass (#132542 ) Summary: We observe the stack mpde can be transformed to cat node to elimiate split nodes, which could further enable the unbind cat optimization, thus we add a more advanced pattern to do the graph transformation Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/de6c1cda-3d74-4a30-8980-7b209b6fe5dc Test UI: https://www.internalfb.com/intern/testinfra/testrun/12103424042268125 Network: Up: 485KiB Down: 728KiB (reSessionID-2f2c01c3-79bb-4e37-b5be-fb77ec09b264) Jobs completed: 29. Time elapsed: 5:19.8s. Cache hits: 0%. Commands: 4 (cached: 0, remote: 0, local: 4) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` P1503698962 before and after graph transformation https://www.internalfb.com/intern/diffing/?paste_number=1504050718 Differential Revision: D60411560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132542 Approved by: https://github.com/jackiexu1992	2024-08-07 02:26:40 +00:00
Xuehai Pan	063a45ed27	Fix infinite recursion while walking to submodules (#132763 ) Fixes https://github.com/pytorch/pytorch/pull/132216#issuecomment-2271555873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132763 Approved by: https://github.com/ezyang	2024-08-07 02:20:17 +00:00
leslie-fang-intel	73c083e02c	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-08-07 02:18:51 +00:00
Edward Z. Yang	ed224554eb	[BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807 Approved by: https://github.com/malfet	2024-08-07 02:15:18 +00:00
Edward Z. Yang	837898d9c8	Stop using preserve_rng_state as decorator (#132774 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132774 Approved by: https://github.com/albanD	2024-08-07 01:07:12 +00:00
cyy	b01402b0a4	[7/N] Fix clang-tidy warnings in aten/src/ATen (#132727 ) Follows #132620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132727 Approved by: https://github.com/Skylion007	2024-08-07 00:29:03 +00:00
drisspg	178dc0c9c7	various doc fixes (#132803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132803 Approved by: https://github.com/Chillee, https://github.com/joydddd, https://github.com/BoyuanFeng ghstack dependencies: #132799	2024-08-07 00:19:42 +00:00
drisspg	cb4d1bfb71	Clean up some tflop calc and add option for saving (#132799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132799 Approved by: https://github.com/BoyuanFeng	2024-08-07 00:19:42 +00:00
PyTorch MergeBot	cbee9c1fd2	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit 0e7e61f7cec82a43f2de52b83eff152d703be7a3. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))	2024-08-07 00:05:20 +00:00
Henry Tsang	e98eac76b3	[inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#132766 ) Summary: This is basically https://github.com/pytorch/pytorch/pull/131304 together with https://github.com/pytorch/pytorch/pull/132594 and absolute path fix for fbcode. Test Plan: ci Differential Revision: D60773405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132766 Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire	2024-08-06 23:56:34 +00:00
PyTorch MergeBot	c7113a6186	Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 )" This reverts commit 1a23ef2ece1c667ee46cd34deb70df2b91bffa32. Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](`1a23ef2ece`). Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))	2024-08-06 23:47:53 +00:00
rzou	0d6caeb259	Add logging + counter for missed reinplacing opportunities (#132758 ) Summary: - We add Inductor logs for what tensors we tried to reinplace, what tensors we were unable to reinplace, and of those tensors, which of those might be bugs (the "missed reinplacing opportunities"). You can tell this by reading the Inductor output graph but the logs make it easier to figure out. - Add a dynamo_compile counter for missed reinplacing opportunities. The goal is to see how widespread existing problems (if any) are. We've had trouble getting all of the edge cases for the reinplacing pass; the counter will help us hunt down issues. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/132758 Approved by: https://github.com/eellison	2024-08-06 23:44:24 +00:00
mori360	cd7f527c59	[3/3] 3D Composability - move tp dp tests (#129802 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802 Approved by: https://github.com/fduwjj ghstack dependencies: #129801	2024-08-06 23:07:07 +00:00
mori360	179b572fd9	[2/3] 3D Composability - move pp tests (#129801 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801 Approved by: https://github.com/wconstab, https://github.com/atalman	2024-08-06 23:07:07 +00:00
Shangdi Yu	825002c9c6	[export][fx] More robust DCE pass (#132764 ) Summary: - make default DCE pass check schema, - need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added). - mark Proxy dump as NotImplemented for better error msg - Remove Proxy from tensors when dumping models, as Proxy cannot be dumped. More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing. Test Plan: CI ``` - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r qat_conv2d - test_export.py - buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export - buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r dce - buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False - buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_fold_bn_erases_bn_node ``` Reviewed By: angelayi Differential Revision: D60319175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764 Approved by: https://github.com/angelayi	2024-08-06 22:27:22 +00:00
wz337	073cee531c	[Test][Easy] Remove print in test_device_mesh.py (#132780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780 Approved by: https://github.com/XilunWu	2024-08-06 22:04:39 +00:00
wz337	1a23ef2ece	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-06 22:00:09 +00:00
eellison	18b678082e	[Easy] log output code path on cache hit (#132718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132718 Approved by: https://github.com/oulgen, https://github.com/masnesral	2024-08-06 21:59:30 +00:00
Edward Z. Yang	3c1033eeb0	Don't auto request review for reopened PRs (#132681 ) This will clobber previous approves. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132681 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-06 21:36:18 +00:00
rzou	2073ddfd1c	Actually report the HOP and subclass/mode when there isn't a registration (#132550 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132550 Approved by: https://github.com/ydwu4	2024-08-06 21:33:10 +00:00
yuqingj	623d0204f0	[NJT] Support Chunk backward for simple cases (#132193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132193 Approved by: https://github.com/soulitzer	2024-08-06 21:20:09 +00:00
Aart Bik	2f908ffa4a	[traced-graph][sparse] sparsity propagation for all current tests (#132690 ) This PR makes sure all current tests in the sparsity export test suite pass. Note that there will probably be anecdotal cases that need fixing after this, but the general idea of preserving sparsity metadata has been completed. Fixes: https://github.com/pytorch/pytorch/issues/117188 ``` $ PYTORCH_TEST_WITH_DYNAMO=0 python test/export/test_sparse.py ........................................................................................................................................................ ---------------------------------------------------------------------- Ran 152 tests OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132690 Approved by: https://github.com/ezyang	2024-08-06 21:18:13 +00:00
dependabot[bot]	029f8fc701	Bump rexml from 3.2.8 to 3.3.3 in /ios/TestApp (#132469 ) Bumps [rexml](https://github.com/ruby/rexml) from 3.2.8 to 3.3.3. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/ruby/rexml/releases">rexml's releases</a>.</em></p> <blockquote> <h2>REXML 3.3.3 - 2024-08-01</h2> <h3>Improvements</h3> <ul> <li> <p>Added support for detecting invalid XML that has unsupported content before root element</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and <code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull parsers</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added more tests for invalid XMLs.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li> <li>Patch by Watson.</li> </ul> </li> <li> <p>Added more performance tests.</p> <ul> <li>Patch by Watson.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li> <li>Patch by tomoya ishida.</li> </ul> </li> </ul> <h3>Thanks</h3> <ul> <li> <p>NAITOH Jun</p> </li> <li> <p>Watson</p> </li> <li> <p>tomoya ishida</p> </li> </ul> <h2>REXML 3.3.2 - 2024-07-16</h2> <h3>Improvements</h3> <ul> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/176">GH-176</a></li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ruby/rexml/blob/master/NEWS.md">rexml's changelog</a>.</em></p> <blockquote> <h2>3.3.3 - 2024-08-01 {#version-3-3-3}</h2> <h3>Improvements</h3> <ul> <li> <p>Added support for detecting invalid XML that has unsupported content before root element</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and <code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull parsers</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added more tests for invalid XMLs.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li> <li>Patch by Watson.</li> </ul> </li> <li> <p>Added more performance tests.</p> <ul> <li>Patch by Watson.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li> <li>Patch by tomoya ishida.</li> </ul> </li> </ul> <h3>Thanks</h3> <ul> <li> <p>NAITOH Jun</p> </li> <li> <p>Watson</p> </li> <li> <p>tomoya ishida</p> </li> </ul> <h2>3.3.2 - 2024-07-16 {#version-3-3-2}</h2> <h3>Improvements</h3> <ul> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`e4a067e112`"><code>e4a067e</code></a> Add 3.3.3 entry</li> <li><a href="`17ff3e7874`"><code>17ff3e7</code></a> test: add a performance test for attribute list declaration</li> <li><a href="`be86b3de0a`"><code>be86b3d</code></a> test: fix wrong test name</li> <li><a href="`b93d790b36`"><code>b93d790</code></a> test: use double quote for string literal</li> <li><a href="`0fbe7d5a0e`"><code>0fbe7d5</code></a> test: don't use abbreviated name</li> <li><a href="`1599e8785f`"><code>1599e87</code></a> test: add a performance test for PI with many tabs</li> <li><a href="`e2546e6eca`"><code>e2546e6</code></a> parse pi: improve invalid case detection</li> <li><a href="`73661ef281`"><code>73661ef</code></a> test: fix a typo</li> <li><a href="`850488abf2`"><code>850488a</code></a> test: use double quote for string literal</li> <li><a href="`46c6397d5c`"><code>46c6397</code></a> test: add performance tests for entity declaration</li> <li>Additional commits viewable in <a href="https://github.com/ruby/rexml/compare/v3.2.8...v3.3.3">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rexml&package-manager=bundler&previous-version=3.2.8&new-version=3.3.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132469 Approved by: https://github.com/ezyang	2024-08-06 21:17:24 +00:00
PyTorch MergeBot	e47b684c33	Revert "Temp disable MKL in DistributionKernels.cpp (#132532 )" This reverts commit 7b2664ece6a961ce9e4557be913c2cead09c7390. Reverted https://github.com/pytorch/pytorch/pull/132532 on behalf of https://github.com/PaliC due to causing numerical instability issues internally ([comment](https://github.com/pytorch/pytorch/pull/132532#issuecomment-2272136210))	2024-08-06 20:57:09 +00:00
Li Yu (ads)	94155ce31b	[Torch] Support meta device in checkpoint (#132684 ) Summary: ## Why utils.checkpoint doesn't support meta device: ``` File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 490, in checkpoint next(gen) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 1359, in _checkpoint_without_reentrant_generator device_module = _get_device_module(device) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 98, in _get_device_module device_module = getattr(torch, device) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/__init__.py", line 1938, in __getattr__ raise AttributeError(f"module '{__name__}' has no attribute '{name}'") AttributeError: module 'torch' has no attribute 'meta' ``` This blocks us from running model with checkpoint enabled in meta mode. ## What This diff handles the case of meta device in checkpoint.py. (in checkpoint.py, device module is manily used when preserve_rng_state=true, which doesn't apply to meta case. So a more elgant fix might be set preserve_rng_state=false when detecting args are on meta device. But I didn't find where to do this check in the minimum way. Let me know if you have ideas.) Test Plan: Tested with toy model which has checkpoint on its module: P1513716944 Differential Revision: D60749427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132684 Approved by: https://github.com/kit1980	2024-08-06 20:45:50 +00:00
Animesh Jain	de00c79583	[dynamo][inline_inbuilt_nn_modules] Mark nn module tensor static for cudagraphs (#132736 ) Fixes https://github.com/pytorch/pytorch/issues/132714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132736 Approved by: https://github.com/mlazos ghstack dependencies: #132538	2024-08-06 20:13:28 +00:00
Shuo Ding	1954bfacda	[Inductor] Small performance, precision, and dependency updates to B2B-GEMM (#132354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132354 Approved by: https://github.com/masnesral	2024-08-06 20:01:27 +00:00
Tugsbayasgalan Manlaibaatar	775c310c0c	Preserve source_fn_stack in the training IR decomp (#132033 ) Title Differential Revision: [D60377712](https://our.internmc.facebook.com/intern/diff/D60377712/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132033 Approved by: https://github.com/angelayi ghstack dependencies: #131988, #131995, #131999	2024-08-06 19:45:40 +00:00
Andrew Gu	4faa5804f6	[c10d] Used float tensor for PG NCCL barrier all-reduce (#132701 ) This helps avoid a CUDA illegal memory access in the NCCL all-reduce part of `barrier()` when the CUDA caching allocator is disabled. NCCL all-reduce seems to assume reading at least 4 bytes. See https://github.com/pytorch/pytorch/issues/132640 for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132701 Approved by: https://github.com/wanchaol, https://github.com/fegin	2024-08-06 19:35:37 +00:00
Xu Han	1e65ccc3de	[inductor] export kernel for gemm template. (#132580 ) Changes: 1. Move `get_export_declaration` to global scope. 2. Export kernel for gemm template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580 Approved by: https://github.com/ezyang	2024-08-06 18:52:22 +00:00
Max Ren	81a5a7a30a	[Quantizer] Fix getattr for quantizing constants (#132705 ) Mobilebert quantization was failing because there were embedding constants that could not be accessed through getattr(). It seems that we have to search the submodule for the embeddings. Which we do here. This is just to help get around looking at unlifted attrs to check if they are large scalars Differential Revision: [D60492338](https://our.internmc.facebook.com/intern/diff/D60492338/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132705 Approved by: https://github.com/jerryzh168 ghstack dependencies: #132704	2024-08-06 18:16:27 +00:00
Edward Z. Yang	c2bccfd431	[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675 Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519 ghstack dependencies: #132674	2024-08-06 18:13:22 +00:00
Max Ren	1de4ebc85d	[Quantizer] Fix Maxpool2d share q params (#132704 ) There seems to be a bug in the code for sharing q params for maxpool2d. This case occurs when output_node = maxpool_node. When this happens we overwrite the node's "quantization_annotation" metadata. This fix ensures that qparams are indeed shared across input and output Differential Revision: [D60492341](https://our.internmc.facebook.com/intern/diff/D60492341/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132704 Approved by: https://github.com/jerryzh168	2024-08-06 18:13:16 +00:00
Bin Bao	db0bd04151	[AOTI] Switch to use shim v2 for fbcode (#132750 ) Summary: As title Test Plan: CI Reviewed By: hl475, ColinPeppler Differential Revision: D57899065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132750 Approved by: https://github.com/angelayi	2024-08-06 17:57:32 +00:00
Brian Hirsh	8d2c272e5a	properly register conjugate/neg fallthroughs to prim ops (#132699 ) A few aten ops (like `clone` and `copy_` get fallthrough registrations to the Conjugate/Negative keys. We haven't been giving the same treatment to their corresponding `prims` variants, which can cause infinite loops in some cases. Fixes an infinite loop that showed up in tests from https://github.com/pytorch/pytorch/pull/132563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132699 Approved by: https://github.com/albanD	2024-08-06 17:57:04 +00:00
Thanh Ha	c6582f11cd	Add get_optin_feature() to allow opt-in to amz2023 (#131792 ) This extends the runner determinator to be able to opt-in to keywords to provide additional options when determining which systems to run jobs on. This enables us to support opt-in users to Amazon Linux 2023. This change creates a generic get_optin_feature() which hopefully will be useful to handle additional future features that we might want to experiment with. This change has kept backwards compatability with the existing issue userlist format and adds support for the comma-separated list of users in a backwards compatible way. The user list has the following rules: - Users are GitHub usernames with the @ prefix - If the first line is a "*" then all users will use the new runners - If the first line is a "!" then all users will use the old runners - Each user is also a comma-separated list of features/experiments to enable - A "#" prefix indicates the user is opted out of the new runners but is opting into features/experiments. Example user list: ``` @User1 @User2,amz2023 #@UserOptOutOfNewRunner,amz2023 ``` This closes pytorch/ci-infra#249. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131792 Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi	2024-08-06 17:54:20 +00:00
Brian Hirsh	e3394e5548	torch.autograd.graph.increment_version: accept List[Tensor], use in AOTDispatcher (#132652 ) The regression from https://github.com/pytorch/pytorch/issues/132281 pinpoints `e4ace1a396` as the cause. The main delta that commit introduces is that we now manually check `is_inference()` and call `increment_version()` (a pybind call) on every mutated input tensor to the graph. This PR attempts to reduce overhead a bit by bundling up all of those checks into a single pybind call, by: (1) updating `torch.autograd.graph.increment_version()` to accept a `Union[Tensor, List[Tensor]]` (2) updating its semantics to no-op if you pass in a tensor with no version counter, instead of erroring Pull Request resolved: https://github.com/pytorch/pytorch/pull/132652 Approved by: https://github.com/albanD	2024-08-06 17:46:48 +00:00
Shangdi Yu	af67b8df6d	[export] Fix exportdb test (#132678 ) Summary: FIx exportdb test for tensor_setattr. copy.deepcopy(deepcopy) can fail if tensor inputs have attribute (i.e. __dict__). We remove it before deepcopy. Before the fix, we have ``` inputs[0].__dict__ {'attr': FakeTensor(..., size=(3, 2))} ``` the test errors out with ``` ====================================================================== ERROR: test_exportdb_supported_case_tensor_setattr (caffe2.test.export.test_serialize.TestDeserialize) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/testing/_internal/common_utils.py", line 529, in instantiated_test test(self, **param_kwargs) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 878, in test_exportdb_supported self.check_graph(model, case.example_args, _check_meta=_check_meta) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 548, in check_graph _check_graph(pre_dispatch=True) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 506, in _check_graph copy.deepcopy(inputs), File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy y = copier(x, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in _deepcopy_tuple y = [deepcopy(a, memo) for a in x] File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in <listcomp> y = [deepcopy(a, memo) for a in x] File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy y = copier(memo) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 206, in __deepcopy__ new_tensor.__dict__ = deepcopy(self.__dict__, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy y = copier(x, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 231, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy y = copier(memo) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 108, in __deepcopy__ or (type(self) is not Tensor and self.data_ptr() == 0) RuntimeError: Cannot access data pointer of Tensor (e.g. FakeTensor, FunctionalTensor). If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html ``` Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_exportdb_supported_case_tensor_setattr ``` Differential Revision: D60610860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132678 Approved by: https://github.com/zhxchen17	2024-08-06 17:45:10 +00:00
Brian Hirsh	e6eee04875	dynamo: use equality guards instead of id guards for Placement/DeviceMesh (#124401 ) After talking to @anijain2305, we probably can't land this since it won't work for C++ guards. But we should still be able to do better than ID_MATCH Pull Request resolved: https://github.com/pytorch/pytorch/pull/124401 Approved by: https://github.com/anijain2305	2024-08-06 17:14:44 +00:00
soulitzer	f50621989b	Construct NJT without graph breaks (#130292 ) Combines contributions from https://github.com/pytorch/pytorch/pull/130505 Some context can be found in this large comment block: `a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)` Changes in this PR - For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager. - Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic). - (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.) Basically unchanged, but worth noting: - `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified. - to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack. Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292 Approved by: https://github.com/bdhirsh ghstack dependencies: #131916, #131803	2024-08-06 17:03:39 +00:00
soulitzer	406b50835b	Use FakeTensor cache for subclass inner tensors (#131803 ) Rewrite of original PR in https://github.com/pytorch/pytorch/pull/130291 To answer review comments from https://github.com/pytorch/pytorch/pull/130291#pullrequestreview-2166671953: > At a higher level, do we need this? Today, this should not change the behavior of anything. But an invariant of "same tensor always corresponds to the same FakeTensor" is nice (from discussion with @bdhirsh). > Why does this happen? Today, both dynamo and meta_utils do some recursion when it comes to FakeTensors. So whenever we fakify a subclass, the process would roughly like: ``` wrap_to_fake (subclass) meta_utils (subclass) meta_utils (values) -> not cached because we use callback meta_utils(offsets) -> not cached because we use callback wrap_to_fake (values) wrap_to_fake (offsets) -> cached because we rely on top-level meta_utils ``` However, we know that: - Caching only occurs at the top-level of meta_utils. - The return value of the top-level wrap_to_fake is returned. This means that after all of this: - The fakified subclass holds inner FakeTensors that are NOT part of the cache - values/offsets are Fakified a second time, and those instances are cached. Differential Revision: [D60406773](https://our.internmc.facebook.com/intern/diff/D60406773) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131803 Approved by: https://github.com/ezyang ghstack dependencies: #131916	2024-08-06 17:03:39 +00:00
soulitzer	a94c441e48	Fix symbolic nested int printing (#131916 ) Differential Revision: [D60406775](https://our.internmc.facebook.com/intern/diff/D60406775) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131916 Approved by: https://github.com/Skylion007, https://github.com/jbschlosser	2024-08-06 17:03:39 +00:00
Edward Z. Yang	ffdf48e63b	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-06 17:03:17 +00:00
Pian Pawakapan	7045bc5a77	[export] change error message for specializations (#132698 ) https://github.com/pytorch/pytorch/pull/130775 recently killed forced specializations for export on complex guards, so the only way we now get a specialized value is if we're able to solve for it. For example, if we have guards `s0 * 2 = s1`, `s0 + 6 = s1`, we specialize `s0 = 6; s1 = 12`. That might look like this: ``` class Foo(torch.nn.Module): def forward(self, x, y): return x.reshape([-1]) + y dy = Dim("dy", min=6) x, y = torch.randn(6, 2), torch.randn(12) dynamic_shapes = { "x": (dy - 6, 2), "y": (dy,), } ``` Our current error message is: `{symbol} must be specialized to {value} because the guards generated for it are too complex` This is now misleading, so we change it to: `solving the guards generated for {symbol} resulted in a specialized value of {value}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132698 Approved by: https://github.com/avikchaudhuri	2024-08-06 16:59:53 +00:00
Jiashen Cao	ca7ce2fca1	[ts-migration][1/N]: Add prim::Loop for constant number of iterations and condition (#131418 ) #### Description This PR adds prim::Loop support for the simplest case where the number of iteration is constant and the loop termination condition is also a constant. [PR by stages](https://docs.google.com/document/d/1q6OprW3HBHbYPwEyE_DikBn-uzmhnN284Cmen_CnlhI/edit?usp=sharing) #### Test Plan Add reprod example. * `pytest test/export/test_converter.py -s -k test_ts2ep_with_loop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131418 Approved by: https://github.com/angelayi	2024-08-06 16:51:08 +00:00
C	c803e35c4b	Reduce number of guards introduced by check_cudnn_tensor_shapes when cudnn version is higher enough (#132384 ) I found that when using TorchDynamo (torch.compile) with dynamic shape on H100, there are some extra guards added to check the sequence length of inputs of `scaled_dot_product_attention` to be divisible by 64. These guards cause unwanted recompilations when the input shape changes. In fact these guards are not necessary if our CUDNN version is higher enough, So I change the order of those checks to use short-circuit rules to skip those checks and avoid unnecessary guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132384 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-08-06 16:48:13 +00:00
andrewor14	fc7849b93f	[pt2e][quant] Ensure BN node is erased after convert (#131651 ) Summary: Previously, when folding BN into conv, we rely on DCE to clean up the unused BN node from the graph. This works if the model is already in eval mode, but fails if the model is still in train mode because DCE doesn't remove nodes with potential side effects (in this case `_native_batch_norm_legit`). This required users to move the model to eval mode before calling convert in order to get a properly DCE'd graph. To solve this, we manually erase the BN node after folding instead of relying on DCE. This relaxes the ordering constraints between `move_exported_model_to_eval` and `convert_pt2e`. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node Reviewers: jerryzh168, yushangdi Subscribers: jerryzh168, yushangdi, supriyar Differential Revision: [D60520149](https://our.internmc.facebook.com/intern/diff/D60520149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651 Approved by: https://github.com/yushangdi, https://github.com/leslie-fang-intel	2024-08-06 16:37:39 +00:00
Randolf Scholz	679cdf606a	Converted `__all__` literal tuple to literal list. (#132404 ) Partial Fix for #131765. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132404 Approved by: https://github.com/soulitzer	2024-08-06 15:12:32 +00:00
Tobias Ringwald	6753ee127c	Allow torch.cuda.memory.mem_get_info to take a device str argument with an unspecified device index. (#132616 ) `torch.cuda.memory.mem_get_info` allows device strings given the current type hints. However, `device = torch.device('cuda')` leads to `device.index = None`, which results in downstream problems. Setting `optional=True` will insert the default device index in such cases. Fixes #132583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132616 Approved by: https://github.com/soulitzer	2024-08-06 13:19:46 +00:00
PyTorch MergeBot	7100c36c8a	Revert "[inductor] export kernel for gemm template. (#132580 )" This reverts commit 87d46d70d7754e32eb0e6689688f4336e4e7c955. Reverted https://github.com/pytorch/pytorch/pull/132580 on behalf of https://github.com/PaliC due to sys is not defined in torch/_inductor/codegen/cpp_utils.py ([comment](https://github.com/pytorch/pytorch/pull/132580#issuecomment-2271264974))	2024-08-06 13:15:15 +00:00
cyy	656a4d1408	[6/N] Fix clang-tidy warnings in aten/src/ATen (#132620 ) Follows #132565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132620 Approved by: https://github.com/Skylion007	2024-08-06 13:07:16 +00:00
Michael Lazos	a8f0979962	Add cudagraph static inputs logging (#132726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132726 Approved by: https://github.com/anijain2305	2024-08-06 12:01:20 +00:00
Lei Ding	da320214e6	Format tensor (#127992 ) Align tensor display Pull Request resolved: https://github.com/pytorch/pytorch/pull/127992 Approved by: https://github.com/janeyx99	2024-08-06 07:10:16 +00:00
chilli	728374d7f7	Changed create_block_mask to just accept BLOCK_SIZE (#132697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132697 Approved by: https://github.com/drisspg	2024-08-06 04:37:15 +00:00
Dan Zimmerman	91df66ee74	[caffe2] Wrap constexpr with preprocessor statements (#132582 ) Summary: When the preprocessor check we leave an unused constexpr around, so when `-Wunused-const-variable` is enabled we get an error. Let's inline these values since they're not used anywhere else in order to avoid this. Test Plan: CI Differential Revision: D60723823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132582 Approved by: https://github.com/houseroad	2024-08-06 04:35:06 +00:00
Robert Hardwick	4260f365ba	[inductor] Replace torch.allclose with torch.testing.assert_close in test_fx_fusion (#130618 ) Preventative fix of a test failure with oneDNN v3.5 upgrade where order of float32 arithmetic may change in torch.admm ( bias term can be at the start or end of the arithmetic ) resulting in slightly different output due to float32 precision loss. Replaced occurrences of torch.allclose with ~~torch._dynamo.testing.same~~ torch.testing.assert_close which is the recommended approach as per this issue https://github.com/pytorch/pytorch/issues/56544 ,the default tolerance is more relaxed than torch.allclose which satisfies the test with upcoming oneDNN change. This should fix aarch64 ci failures in #129932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130618 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-08-06 03:58:43 +00:00
fduwjj	4e610924d4	[c10d] Add a new API for adding ephemeral timeout for one local rank and the timeout will reset when the first collective finishes (#130905 ) We provide an API for user to add ephemeral timeout across all PGs within one rank and the timeout will reset when the first collective issued after the timeout added finishes. Each extension only covers collectives after the issue and before the first collective finished. The diagram below shows how the timeout changes: <img width="1174" alt="image" src="https://github.com/user-attachments/assets/354923b7-581c-40de-ae0f-1cd3da273ccc"> While this feature provides flexibility in specific scenarios, it introduces statefulness to timeout setting. Therefore, it is advisable to use this API sparingly and consider alternative approaches, such as directly setting the timeout or utilizing a barrier collective (one can set any timeout to the barrier), whenever feasible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130905 Approved by: https://github.com/ezyang	2024-08-06 03:47:58 +00:00
Wang, Eikan	39c9b75a68	Add registration mechanism for aoti model runner (#131638 ) Current AOTI model runner has supported CUDA and CPU. However, in terms of a particular out-of-tree backend, it is not easier to support the feature. This PR intends to provide a registration mechanism to support this case by providing two: `RegisterAOTIModelRunner` and `getAOTIModelRunnerRegistry`. - `RegisterAOTIModelRunner` is used to register a function(`AOTIModelRunnerABC`) to create a `AOTIModelContainerRunner`. The function signature is as follows. ```C++ using AOTIModelRunnerABC = std::shared_ptr<AOTIModelContainerRunner> (*)( const std::string& model_so_path, size_t num_models, const std::string& device_str, const std::string& bin_dir); ``` - `getAOTIModelRunnerRegistry` is used to get all the registered backends. In terms of a new backend, it needs to define its `AOTIModelContainerRunner` class and then register a `AOTIModelRunnerABC` function to `aoti` to create its `AOTIModelContainerRunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131638 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-08-06 02:47:35 +00:00
Edward Z. Yang	345bea01dc	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD	2024-08-06 02:35:45 +00:00
Shangdi Yu	93fad2f0f2	[export] Fix import in D60427208 (#132707 ) Summary: D60427208 broke APS release by failing our NE deterministric test. https://www.internalfb.com/intern/test/562950111197340/ This Diff fixes it. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text test_mtml_instagram_model_474023725_single_gpu_with_ir ``` Differential Revision: D60790203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132707 Approved by: https://github.com/ydwu4	2024-08-06 02:35:17 +00:00
Yan Zhiwei	2f16e68cab	[Intel GPU] Allow XPU device in copy, cdist, index_put_impl (#130088 ) # Motivation `copy`, `cdist`, `index_put_impl` operators use `op_stub` for runtime dispatching inside operators. Extra device list is inside them to assure the accuracy, while XPU is not in them. This PRs make them allow XPU as a supported device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130088 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #130019, #130082	2024-08-06 01:55:50 +00:00
PyTorch MergeBot	38674bcb45	Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749 )" This reverts commit eca0cb0fbe84bb0a34fa94afe261bceecd52c436. Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to breaks test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function_tensor_subclass ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2270213988))	2024-08-06 01:55:41 +00:00
Randolf Scholz	d6a24b3b92	Removed duplicate `__all__` declarations. (#132405 ) Partial Fix for #131765. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132405 Approved by: https://github.com/soulitzer	2024-08-06 01:17:44 +00:00
haozhe.zhu	96471ea47c	[inductor] support vectorization for torch.any(bool) -> bool (#132472 ) Support reduction `any` by from `bool` to `bool`. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_any_bool_vec ``` Generated code for `test_any_bool_vec` ``` cpp_fused_any_0 = async_compile.cpp_pybinding(['const float', 'const float', 'bool', 'bool'], ''' #include "/tmp/torchinductor_root/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, bool* out_ptr0, bool* out_ptr1) { { { bool tmp_acc0 = 0; at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0); bool tmp_acc0_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; } at::vec::VecMask<float,1> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); bool tmp_acc0_local = 0; at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0); tmp_acc0_vec_local = tmp_acc0_vec_local \| tmp1; } tmp_acc0_arr[tid] = tmp_acc0_local; tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local; } for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 \|\| tmp_acc0_arr[tid]; } for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec = tmp_acc0_vec \| tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 \|\| at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x \| y; }, tmp_acc0_vec.to<bool, 1>()); out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0); } } { { bool tmp_acc0 = 0; at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0); bool tmp_acc0_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; } at::vec::VecMask<float,1> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); bool tmp_acc0_local = 0; at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0), 16); auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0); tmp_acc0_vec_local = tmp_acc0_vec_local \| tmp1; } tmp_acc0_arr[tid] = tmp_acc0_local; tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local; } for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 \|\| tmp_acc0_arr[tid]; } for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec = tmp_acc0_vec \| tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 \|\| at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x \| y; }, tmp_acc0_vec.to<bool, 1>()); out_ptr1[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0); } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132472 Approved by: https://github.com/jgong5	2024-08-06 01:03:51 +00:00
Brian Hirsh	26c6786109	return_and_correct_aliasing: skip dispatcher when swapping storage (#132524 ) `return_and_correct_aliasing` is used by FunctionalTensor today to ensure that when we call view/inplace ops, the input and output `FunctionalTensors` share the same storage. This was previously done with a dispatcher call to `aten.set_`. In this PR I swap it out with a util that just manually does the storage swap. Benefits: (1) we know this is safe in the specific way it is used by FunctionalTensor: avoiding the extra assertions in `aten.set_` is necessary to avoid some unbacked symint errors (2) this should improve compile times a bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/132524 Approved by: https://github.com/ezyang ghstack dependencies: #132243, #132337, #132322	2024-08-06 00:44:35 +00:00
Antoni Viros	eca0cb0fbe	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-08-05 23:45:48 +00:00
wz337	4306eebab1	[DeviceMesh] Update slicing documentation to include nD and non-continuous slicing (#132311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132311 Approved by: https://github.com/wanchaol ghstack dependencies: #132310	2024-08-05 23:44:23 +00:00
wz337	1add8c5f1c	[Easy][DTensor] Rename args_sharding to args_schema for OpSchema __str__ (#132187 ) Looks like we don't use the name `args_sharding` anywhere else so just changing it to `args_schema` for naming consistency Pull Request resolved: https://github.com/pytorch/pytorch/pull/132187 Approved by: https://github.com/wanchaol	2024-08-05 23:40:19 +00:00
cyy	3ef45e5669	Fix ODR (#131032 ) Fixes ODR violation Pull Request resolved: https://github.com/pytorch/pytorch/pull/131032 Approved by: https://github.com/ezyang	2024-08-05 23:19:49 +00:00
Shuqi Yang	a74e5abda4	Fix issues in activation_memory_budget for float8 (#132687 ) Summary: When using activation_memory_budget for float8 training, two issues were noticed: - When `aggressive_options` (https://fburl.com/code/m1yoskxw) is called , all fp8 gemms (the scaled_mm op) are saved for recomputation. - After adding "scaled_mm" in the `compute_intensive_ops`, we got the next error from `estimate_runtime`: `mat2 must be col_major` from `meta_scaled_mm`. To fix it, modified `materialize_arg` to also include the stride of the original tensor. Test Plan: Run float8 training with `activation_memory_budget`. Differential Revision: D60777297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132687 Approved by: https://github.com/Chillee	2024-08-05 23:01:35 +00:00
Yidi Wu	a4ed8eeb33	[hop] makes compiled hops not share code objects (#132427 ) Fixes code object sharing issue in https://github.com/pytorch/pytorch/issues/132417. Before this Pr, compiled hops such as cond and flex_attenion are wrapped by _dynamo/external_utils.py:wrap_inline. This causes them to share the same code object. There is a condition surrounding the warp_inline call and currently is passing. We make hops fail the check so that they don't share code objects by adding them to LEGACY_MOD_INLINELIST. Adding them to MOD_INLINELIST doesn't work because trace_rules.check(fn) doesn't check for MOD_INLINLIST by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132427 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-08-05 22:59:05 +00:00
Shangdi Yu	4a2cf50edf	[export][reland] Convert autocast to HOO (#132677 ) Summary: Reland of D60206382. Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast" buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_set_grad" ``` Verified that now we can export the llama model in gh issue 128394 and the gemma model in gh issue 131829 without error. Differential Revision: D60770038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132677 Approved by: https://github.com/angelayi	2024-08-05 22:34:52 +00:00
Yifu Wang	ea42027e0e	[micro_pipeline_tp] support all _scaled_mm args (#131984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131984 Approved by: https://github.com/weifengpy	2024-08-05 21:44:37 +00:00
Edward Yang	2b5e31d099	Move sigmoid run_const_graph HOP to PyTorch core (#132526 ) Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible. Test Plan: sandcastle and oss ci Differential Revision: D60674861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132526 Approved by: https://github.com/SherlockNoMad	2024-08-05 21:40:56 +00:00
Brian Hirsh	af8b8a47cb	fsdp.set_: convey to functionalization that it mutates storage (#132322 ) Fixes https://github.com/pytorch/pytorch/issues/132197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132322 Approved by: https://github.com/albanD, https://github.com/yf225 ghstack dependencies: #132243, #132337	2024-08-05 21:28:59 +00:00
Brian Hirsh	1a0db29932	move torch._functionalize APIs to pybind. add one for marking storage mutations (#132337 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132337 Approved by: https://github.com/albanD, https://github.com/justinchuby ghstack dependencies: #132243	2024-08-05 21:28:59 +00:00
Brian Hirsh	4db368a475	make functorch CSE respect mutations as barriers (like fsdp.set_) (#132243 ) Fixes https://github.com/pytorch/pytorch/issues/132200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132243 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/yf225	2024-08-05 21:28:55 +00:00
Fangjun Kuang	ee0ae11b34	Fix a typo in the example code. (#132601 ) Since the backward multiples the gradient by `n`, we must change the forward function to multiply the input tensor by `n`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132601 Approved by: https://github.com/soulitzer	2024-08-05 21:04:20 +00:00
albanD	9a1ad3345f	Fix periodic windows test (#132648 ) This test fails to clean up folders on windows for the past week, see `27f61eba58` for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648 Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet	2024-08-05 20:54:20 +00:00
cyy	6b12dc0224	[Reland] [11/N] Use std::nullopt and std::optional (#132622 ) Reland of #132396, which was reverted due to dependency reversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132622 Approved by: https://github.com/ezyang	2024-08-05 20:36:33 +00:00
Sam Larsen	6f4dc56735	[inductor] Default to 1 compile thread for internal (#132540 ) Summary: The historical default here is "1", i.e., no parallel compilation. In order to prepare for rolling out the subprocess-based parallel compile, I had previously modified this code to allow parallelism when worker_start_method="subprocess". I realize this probably isn't the best rollout strategy. Rather than opting all internal usages into both a) parallel-compile, _and_ b) a new implementation of parallel compile, let's put the default back to "1" and then start rolling out the new parallel compile implementation only to those usages that have already opted in by explicitly setting compile_thread > 1 Differential Revision: [D60686105](https://our.internmc.facebook.com/intern/diff/D60686105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132540 Approved by: https://github.com/c00w	2024-08-05 20:23:16 +00:00
Pearu Peterson	1471473b84	Add tests to bsr_dense_addmm_meta. Tune bsr_dense_addmm kernel for ViT shapes. (#132646 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132646 Approved by: https://github.com/cpuhrsch	2024-08-05 20:22:33 +00:00
Basil Wong	b7bcfdaff2	Change deprecate warning on dispatch_on_subclass to warn once (#132374 ) Summary: # Problem `TORCH_WARN` can cause massive log spam. I output the logs for before and after adding this change. Before: * The log file size was ~61.15 MB(61148028 bytes). After: * The log filesize was ~56.44 MB(56444057) bytes. # Context Looks like we tried to land this change earlier but it was reverted: * D59413413 * Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function # Testing Update `test_warn_on_invalid_torch_function` would fail because the warning would not be called on the handling of the second torch function class since `TORCH_WARN_ONCE` stops repeats globally. Updated so that it runs separate programs. (Was not able to actually run the test, could someone help me with that Test Plan: Need help with this... Differential Revision: D60561181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132374 Approved by: https://github.com/ezyang	2024-08-05 20:02:33 +00:00
PyTorch MergeBot	2764bee942	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit 6919e8baaba391ced7b4acaa553d6ea1f3b30e79. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/clee2000 due to Broke test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_quantized_linear_amx_batch_size_3_in_features_128_out_features_64_bias_False_cpu on sm86 jobs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10252979157/job/28367091621) [HUD commit link](`6919e8baab`) Not caught on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2269808857))	2024-08-05 19:59:04 +00:00
PyTorch MergeBot	a3ea96b762	Revert "[export] Convert autocast to HOO (#131914 )" This reverts commit aec948adfc224e49213c4bc49586d4e4ba65fbbb. Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/davidberard98 due to PR shouldn't have been relanded by the bot, phabricator diff did not have any recent changes and is still internally reverted ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2269797388))	2024-08-05 19:52:09 +00:00
Jack Taylor	1d34f33d00	Scale XBLOCK in triton reduction configs to avoid hitting max grid (#128826 ) Scale XBLOCK size in triton_config_reduction to avoid hitting maxGridSize limits. This issue was observed in gpt-fast examples with large sequence length: Reproducer: https://gist.github.com/jataylo/8a0ba922fbf68e345d360a418b48b9f1 `RuntimeError: Triton Error [HIP]: Code: 9, Messsage: invalid configuration argument` Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128826 Approved by: https://github.com/jansel, https://github.com/nmacchioni	2024-08-05 19:34:38 +00:00
David Berard	e1c2bdac2f	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007	2024-08-05 18:58:33 +00:00
Shangdi Yu	aec948adfc	[export] Convert autocast to HOO (#131914 ) Summary: Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export run_tests("test_predispatch_autocast") ``` Reviewed By: angelayi Differential Revision: D60206382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914 Approved by: https://github.com/angelayi	2024-08-05 18:52:12 +00:00
zdevito	8d9c3a71f6	Support IPC for Expandable Segments (#130890 ) This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed. Differential Revision: [D60547506](https://our.internmc.facebook.com/intern/diff/D60547506) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2	2024-08-05 18:48:13 +00:00
Yidi Wu	618e2c9de4	fix torch rec test failure (#132437 ) Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export" Reviewed By: zhxchen17 Differential Revision: D60528900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437 Approved by: https://github.com/Skylion007	2024-08-05 18:06:07 +00:00
Max Podkorytov	1c7dc335f7	[ROCm][CK][Inductor] Enable addmm for CK backend to gemm max autotune (#130576 ) Add functional support for torch.addmm with CK backend. See also #125453 # Implementation details 1. It turns out we can use the same template between addmm and matmul; essentially, matmul is addmm with empty bias 2. The Python generator in CK was updated to generate the shared cpp template. The pip package can be installed from `pip install git+https://github.com/rocm/composable_kernel@add-addmm` and will be merged into `develop` branch after this PR lands to avoid breaking the current matmul # Testing `pytest test/inductor/test_ck_backend.py -k addmm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130576 Approved by: https://github.com/chenyang78	2024-08-05 17:49:09 +00:00
Nikita Shulga	7b2664ece6	Temp disable MKL in DistributionKernels.cpp (#132532 ) Until https://github.com/pytorch/pytorch/issues/132395 is addressed Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential ) ```python import torch high_bits_for_seed = 16000000000000000000 # to use "good quality" seed _ = torch.manual_seed (high_bits_for_seed + 2024) prob = torch.ones (26) dups_mult = 0 perm_counts_mult = {} for _ in range (1_000_000): p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist()) if p in perm_counts_mult: dups_mult += 1 perm_counts_mult[p] += 1 else: perm_counts_mult[p] = 1 print ('duplicate multinomial perms: ', dups_mult) print ('multiple multinomial perms: ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item()) print ('max of perm_counts_mult: ', torch.tensor (list (perm_counts_mult.values())).max().item()) print ('len (perm_counts_mult): ', len (perm_counts_mult)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132532 Approved by: https://github.com/albanD	2024-08-05 17:40:57 +00:00
PyTorch MergeBot	baa2483cea	Revert "Refactor thunkify to return proper thunk abstraction (#132407 )" This reverts commit c65cb37657ef4f7fcd070a7e8e5121eb299919fd. Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to td strikes again ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2269577711))	2024-08-05 17:39:54 +00:00
cyy	d5045cceff	[16/N] Fix clang-tidy warnings in jit (#132604 ) Follows #132564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132604 Approved by: https://github.com/Skylion007	2024-08-05 17:36:22 +00:00
Wouter Devriendt	e8645fa2b9	[Doc] fix some typos (found by codespell and typos) (#132544 ) Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544 Approved by: https://github.com/kit1980	2024-08-05 17:21:56 +00:00
albanD	3d87dfc088	Add basic OpenReg module scaffolding with autograd (#131708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131708 Approved by: https://github.com/ezyang	2024-08-05 17:07:11 +00:00
Will Constable	df59084012	Drop GIL around cudart APIs (#132520 ) Noticed a hang where the stuck thread blocked on cudaHostUnregister call, probably due to an internal cuda deadlock caused by something else, but was holding the GIL at the time and blocked other python threads. As far as I can tell cudart APIs all do not require the GIL held nor are they marked as thread unsafe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132520 Approved by: https://github.com/LucasLLC, https://github.com/kirtiteja	2024-08-05 17:04:01 +00:00
Kulin Seth	6919e8baab	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-08-05 17:02:30 +00:00
Kiuk Chung	d532c00c81	[test/torch_np] Fix usages of deprecated NumPy 2.0 APIs in numpy_tests (#131909 ) Migrates usages of deprecated APIs in NumPy-2.0 per [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#numpy-2-0-migration-guide). I did a grep on the old API usages (see list below) and these were used only referenced in test files under `test/torch_np/numpy_tests/*/.py`. Specifically, migrates the usages of the following APIs: 1. `np.sctypes` → Access dtypes explicitly instead 2. `np.float_` → `np.float64` 3. `np.complex_` → `np.complex128` 4. `np.longcomplex` → `np.clongdouble` 5. `np.unicode_` → `np.str_` 6. `np.product` → `np.prod` 7. `np.cumproduct` → `np.cumprod` 8. `np.alltrue` → `np.all` 9. `np.sometrue` → `np.any` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131909 Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman	2024-08-05 16:21:08 +00:00
Xu Han	a672f6c84e	[inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py (#132615 ) [inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/132615 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 16:00:35 +00:00
Xu Han	9945caec65	[inductor] Fix autotune non-close attr crash on Windows (#132630 ) When I enable `autotune` related UT on Windows. <img width="1364" alt="Image" src="https://github.com/user-attachments/assets/b0c9c516-419d-47d0-a4c1-e90c98109d02"> I found the non `close` attr issue on Windows. Acturaly, I checked the DLL type is `CDLL`. It doesn't have `close` attr. I made this PR to check the `close` attr and do the close operation. <img width="1624" alt="Image" src="https://github.com/user-attachments/assets/14093900-4ad8-4673-839e-7ba1410c5656"> After this fix, the UTs passed. Here are some existing issues: 1. `CDLL` didn't have `close` attr, so the DLL are not be closed. Though it did't crash on Linux. 2. This PR just avoid crash on Windows, and didn't real close also. TODO: We need to replace `CDLL` by `DLLWrapper` in `CppBenchmarkRequest`, like `CUDABenchmarkRequest`. I have added a task to tracking: https://github.com/pytorch/pytorch/issues/124245 , and will follow up this change in further PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132630 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 16:00:27 +00:00
Aart Bik	a8490a0762	[traced-graph][sparse] propagate sparsity in fx graph (#131920 ) This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well. Feature request: https://github.com/pytorch/pytorch/issues/117188 Rebranch of older PRs (for history): https://github.com/pytorch/pytorch/pull/131474 https://github.com/pytorch/pytorch/pull/128549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920 Approved by: https://github.com/ezyang	2024-08-05 15:49:53 +00:00
Aleksei Nikiforov	14edd986b3	Fix missing include file (#132647 ) This error only appears with newer gcc releases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132647 Approved by: https://github.com/Skylion007	2024-08-05 15:49:49 +00:00
Andrew Gu	70cb16b316	[DTensor] Added naive replicate strategy for more diagonal ops (#132201 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132201 Approved by: https://github.com/wz337 ghstack dependencies: #132104	2024-08-05 15:18:56 +00:00
Edward Z. Yang	c65cb37657	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD ghstack dependencies: #131649	2024-08-05 14:42:40 +00:00
Brian Hirsh	b465a5843b	DTensor: add more foreach ops to supported sharding prop list (#132066 ) fixes https://github.com/pytorch/pytorch/issues/132016. Right now if you run an op that DTensor has no sharding prop rule, and that op accepts non-trivial pytrees of inputs tensors as arguments, DTensor can end up infinite looping before it has the chance to error due to not having a sharding prop rule. This PR doesn't fix the problem, but adds rules for the culprit ops (missing foreach ops) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132066 Approved by: https://github.com/wanchaol	2024-08-05 13:51:59 +00:00
Gabriel Ferns	c3ee07c71c	add missing profiler include in cpp code generation (#132419 ) Summary: When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check config.profiler_mark_wrapper_call. Test Plan: This case is already covered in test_profiler_mark_wrapper_call. ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 1 test in 8.080s OK ``` Fixes https://github.com/pytorch/pytorch/issues/131339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 13:40:47 +00:00
Andrew Gu	b30d0916d9	[FSDP2] Added missing event wait (for future) (#132568 ) Nothing is actually wrong currently, but we should add this in case we land https://github.com/pytorch/pytorch/pull/127032 in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132568 Approved by: https://github.com/weifengpy, https://github.com/Skylion007	2024-08-05 12:44:46 +00:00
wz337	fb87796d4f	[DeviceMesh] Add supports for non-continuous slicing (#132310 ) Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310 Approved by: https://github.com/wanchaol	2024-08-05 09:30:07 +00:00
Avik Chaudhuri	27f61eba58	serde sympy functions (#132493 ) Summary: Sympy functions appearing in symbolic expressions inside tensor metadata were not being deserialized properly. Test Plan: updated test Differential Revision: D60573150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132493 Approved by: https://github.com/pianpwk	2024-08-05 08:08:50 +00:00
Feng Shi	55b0c39d82	Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132182 ) Summary: Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)"" The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests. The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause. See D54134695 or #124969 for more details. Test Plan: Originally failed tests f585704630 f585733786 Diff patched: f586664028 f586663820 Differential Revision: D60458597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182 Approved by: https://github.com/Yuzhen11	2024-08-05 06:57:30 +00:00
haozhe.zhu	ae44b8f410	[inductor] support vectorization for torch.argmax/min(float/int64_t)-> int64_t (#131016 ) Support reduction argmin/max by scalar implementation. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_argmax_argmin_with_nan_value python test/inductor/test_cpu_repro.py -k test_argmin python test/inductor/test_cpu_repro.py -k test_reduction_cpu_only ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131016 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-05 04:31:53 +00:00
Wu, Chunyuan	1fb498d6e3	Add try except for _maybe_evaluate_static call in IndexPropagation (#132128 ) Fixes the Inductor max-autotune mode failures of the below models: - GPT2ForSequenceClassification - PegasusForConditionalGeneration - XGLMForCausalLM - hf_GPT2 - tnt_s_patch16_224 ```log File "/pytorch/torch/_inductor/index_propagation.py", line 329, in statically_true evaluated = self.shape_env._maybe_evaluate_static( File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1499, in wrapper return fn_cache(self, args, *kwargs) File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4539, in _maybe_evaluate_static vr = var_ranges[k] torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: KeyError: m_start ``` The `_maybe_evaluate_static` call in `IndexPropagation` may fail. This PR adds try except following the way in `torch/_inductor/sizevars.py` by adding a common utility function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132128 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-05 01:02:51 +00:00
Jianyu Huang	c7cfa51721	Always use high precision for SDPA math backend (#128922 ) Summary: feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts. Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16. Differential Revision: D58710805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922 Approved by: https://github.com/xw285cornell, https://github.com/drisspg	2024-08-04 23:58:14 +00:00
William Wen	01cdcbf7c8	[dynamo] revert map/zip iterator related changes (#132528 ) Need to revert due to internal hangs: S437700 This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64. Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)" This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3. Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)" This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9. Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)" This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528 Approved by: https://github.com/ZainRizvi	2024-08-04 18:46:55 +00:00
Oguz Ulgen	09f9c256ad	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-04 18:43:37 +00:00
Oguz Ulgen	6e79932543	Add basic mypy annotations to dynamo (#132415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu	2024-08-04 18:43:36 +00:00
PyTorch MergeBot	3558a8cf4a	Revert "Add basic mypy annotations to dynamo (#132415 )" This reverts commit 71e22e0959eb8d5a66833bf5c6b5903536a5bef1. Reverted https://github.com/pytorch/pytorch/pull/132415 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
PyTorch MergeBot	f2ddd5e9e0	Revert "Add basic mypy annotations to inductor (#132416 )" This reverts commit 78927d37f6085a0b30269cceb731d8097302c091. Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
PyTorch MergeBot	9be33bc584	Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 )" This reverts commit 6c65fd03942415b68040e102c44cf5109d2d851e. Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/ZainRizvi due to Sorry, had to revert this to revert another PR that depends on this change ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2267629534))	2024-08-04 18:30:59 +00:00
PyTorch MergeBot	0a25666f92	Revert "[dynamo] revert map/zip iterator related changes (#132528 )" This reverts commit e81e74ca6cb45e1ab831ddfe9a2ba5c7e17fa03f. Reverted https://github.com/pytorch/pytorch/pull/132528 on behalf of https://github.com/ZainRizvi due to This stack entered a weird state in the diff train. Reverting and relanding to clean the state ([comment](https://github.com/pytorch/pytorch/pull/132528#issuecomment-2267628475))	2024-08-04 18:26:09 +00:00
Aaron Gokaslan	fd4b649e6c	[BE]: Simplify some list comps to generators C419 (#132578 ) Simplifies some list comprehensions to generator which is more efficient. Automatically applied diffs for the most part with ruff Pull Request resolved: https://github.com/pytorch/pytorch/pull/132578 Approved by: https://github.com/ezyang	2024-08-04 17:46:26 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Xuehai Pan	c35061c542	Migrate Python code formatter from `black` to `ruff format` (#132574 ) See also: - #124845 - #123062 Closes #124845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132574 Approved by: https://github.com/ezyang	2024-08-04 17:13:31 +00:00
Jiashen Cao	09fcd792eb	[Fix]: ScriptObject lifting issue (#130952 ) #### Issue ScriptObject was treated as normal attribute by the converter previously. This PR lifts it to be a constant and convert it directly to a GetAttr fx node. ScriptObject would also trigger `CallMethod` and this PR adds that support as well. #### Test Plan Add test case for ScriptObject. `pytest test/export/test_converter.py -s -k test_convert_script_object` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130952 Approved by: https://github.com/angelayi	2024-08-04 16:52:45 +00:00
PyTorch MergeBot	5dac4d2c78	Revert "[easy] fix f-string messages in torch/_ops.py (#132531 )" This reverts commit 908d2a153b14cbb7a39c1f4ef9a77534cf2c71bf. Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to still breaks tests ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2267584289))	2024-08-04 15:41:56 +00:00
cyy	105ba7b58c	[5/N] Fix clang-tidy warnings in aten/src/ATen (#132565 ) Follows #132001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132565 Approved by: https://github.com/Skylion007	2024-08-04 14:39:16 +00:00
David Berard	908d2a153b	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007 ghstack dependencies: #132356, #132466	2024-08-04 14:30:42 +00:00
Xu Han	87d46d70d7	[inductor] export kernel for gemm template. (#132580 ) Changes: 1. Move `get_export_declaration` to `cpp_utils.py` as basic function. 2. Export kernel for gemm template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580 Approved by: https://github.com/ezyang	2024-08-04 11:17:19 +00:00
Xuehai Pan	d2dc173664	Remove lint dependency `ufmt` (#132573 ) `ufmt` is a combination of `black + usort`. This PR removes `ufmt` and run `black` and `usort` separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132573 Approved by: https://github.com/ezyang ghstack dependencies: #129769, #132572	2024-08-04 10:24:09 +00:00
Xuehai Pan	f7aeb394b6	[BE][Easy] Remove empty `ISORT_SKIPLIST` (#132572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132572 Approved by: https://github.com/ezyang, https://github.com/justinchuby ghstack dependencies: #129769	2024-08-04 10:24:09 +00:00
Xuehai Pan	f3fce597e9	[BE][Easy][17/19] enforce style for empty lines in import segments in `torch/[a-c]/` and `torch/[e-n]/` (#129769 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129769 Approved by: https://github.com/ezyang	2024-08-04 10:24:09 +00:00
Dan Zimmerman	2714adce20	[caffe2] Fix compiling ATen-hip in non-opt mode (#132581 ) Summary: It looks like https://github.com/pytorch/pytorch/pull/131894 accidentally broke non-opt hip builds. I.e. `is_flash_attention_available` doesn't get inlined in non-opt mode, so all of `can_use_flash_attention` is compiled into the final object file. This includes a reference to `aotriton::v2::flash::check_gpu` which we haven't setup yet for HIP builds. Test Plan: CI Differential Revision: D60720707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132581 Approved by: https://github.com/jianyuh, https://github.com/xw285cornell	2024-08-04 07:51:18 +00:00
cyy	522fa03e91	[Submodule] Bump ONNX to v1.16.2 (#132566 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566 Approved by: https://github.com/justinchuby	2024-08-04 07:01:54 +00:00
Wei Feng	2a8e94347f	[TP] verify numeric parity on Transfromers for multiple iterations (#132543 ) Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations this PR sets a baseline of TP numerics. I can verify fp8 on top of it Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543 Approved by: https://github.com/tianyu-l ghstack dependencies: #132350	2024-08-04 06:43:27 +00:00
Gabriel Ferns	8ff310392e	add __torch_function__ handler to get_device cpp (#132567 ) From the issue: ``` import torch class CustomParameter(torch.nn.Parameter): @classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): return func.__name__ x = CustomParameter(torch.rand(2)) print(x.square()) # 'square' print(torch.square(x)) # 'square' print(x.get_device()) # 'get_device' print(torch.get_device(x)) # -1 ``` after fix: ``` $ python repro.py square square get_device get_device ``` Fixes: https://github.com/pytorch/pytorch/issues/131944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132567 Approved by: https://github.com/ezyang	2024-08-04 04:26:30 +00:00
Xu Han	7f8a384a8f	[inductor] add msvc_cl compiler check (#132571 ) add `msvc_cl` compiler check. Local test: <img width="880" alt="image" src="https://github.com/user-attachments/assets/fe4da5e0-dd52-4dbc-831e-c32479e27a29"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132571 Approved by: https://github.com/ezyang	2024-08-04 03:48:25 +00:00
Feng Yuan	81b8d3586f	Update torch-xpu-ops pin (ATen XPU implementation) (#132390 ) Regular update. 1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml. 2. Align with PyTorch in-tree to use safe data pointer access APIs. 3. Enable FP64 conversion emulation for some platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390 Approved by: https://github.com/EikanWang	2024-08-04 02:22:46 +00:00
CaoE	6ec4af6865	[Inductor][CPP] Add vectorization support for double (#131886 ) Before: ``` extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = decltype(tmp0)(tmp0 * tmp0); out_ptr0[static_cast<long>(x0)] = tmp1; } } } } ``` After: ``` extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<long>(x0), 16); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131886 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-04 02:13:21 +00:00
PyTorch MergeBot	d984105748	Revert "[export] Convert autocast to HOO (#131914 )" This reverts commit b28c01d90d6575522d2240ce485d7dd87a7242aa. Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/ezyang due to Failing lint, but was covered up by master failure on lint ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2267248773))	2024-08-04 02:10:35 +00:00
Adnan Akhundov	6c65fd0394	[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820 Approved by: https://github.com/eellison	2024-08-03 22:11:47 +00:00
cyy	bc46f205c4	[15/N] Fix clang-tidy warnings in jit (#132564 ) Follows #132477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132564 Approved by: https://github.com/Skylion007	2024-08-03 19:33:24 +00:00
PyTorch MergeBot	00097f3458	Revert "C++ network flow implementation in c10 (#132188 )" This reverts commit dccce77935bb023f225b9972929fd9213e754e84. Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be failing internal tests. Please see D60702564 to investigate ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2267098420))	2024-08-03 18:44:28 +00:00
Xu Han	e3387c6712	[inductor] use uint64_t replace long to add Windows support. (#132491 ) `long` type is different between `Windows` and `Linux`. This PR use `int64_t` instead of `long` on Windows. `LL` suffix is used to initial `int64_t` value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132491 Approved by: https://github.com/malfet	2024-08-03 18:38:30 +00:00
Yanbo Liang	bbce517221	[Inductor][FlexAttention] TestFlexAttention -> TestFlexDecoding (#132547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132547 Approved by: https://github.com/Chillee ghstack dependencies: #132015	2024-08-03 17:26:44 +00:00
PyTorch MergeBot	21d02f8b4b	Revert "[easy] fix f-string messages in torch/_ops.py (#132531 )" This reverts commit 25903f3932b3a24d4edf323484d2159f3ac92999. Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to broke lint and tests due to conflict with 132377 ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2266743391))	2024-08-03 14:49:07 +00:00
Pian Pawakapan	a896fb1b36	check unsupported sympy functions for runtime asserts (#132457 ) Some sympy Functions aren't supported by sympy_interp(); we can't turn them into FX nodes, so currently the runtime asserts CSE pass avoids CSE'ing on any expression containing a sympy Function. https://github.com/pytorch/pytorch/pull/132325 started tracking unsupported functions, so we switch the check to that to be more precise. We also check for and skip unsupported functions when adding asserts - previously we only did the check for CSE, and not adding new expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132457 Approved by: https://github.com/avikchaudhuri	2024-08-03 10:17:25 +00:00
Xuehai Pan	0e7e61f7ce	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-03 09:43:38 +00:00
Jiashen Cao	159d508f03	[Fix]: prim::If with multiple outputs and input return directly (#131779 ) #### Issue Test is not working for prim::Loop with multiple outputs. Additionally fix issue where input is directly returned, which is not supported by HigherOrderOp. #### Test Plan `pytest test/export/test_converter.py -s -k test_convert_if_multiple_out` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131779 Approved by: https://github.com/angelayi, https://github.com/SherlockNoMad	2024-08-03 08:07:21 +00:00
Xu Han	36ec0fdf10	[inductor] check compiler exist on Windows. (#132533 ) Current Windows env, if we are not activate the MSVC env. It will not raise a clear error to compiler: <img width="904" alt="image" src="https://github.com/user-attachments/assets/725ea608-d181-40b1-8930-42fe2b32643a"> With this PR, we can help users point to the issue is from compiler. <img width="1034" alt="image" src="https://github.com/user-attachments/assets/8515a796-e3e9-4909-a68f-8a14d4864951"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132533 Approved by: https://github.com/jansel	2024-08-03 07:47:11 +00:00
Adnan Akhundov	8ad9f89ccc	[inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562 ) Summary: This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally. We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_ autotune_with_unsupported_args ... ---------------------------------------------------------------------- Ran 3 tests in 3.636s OK ``` Differential Revision: D60701839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562 Approved by: https://github.com/chenyang78	2024-08-03 06:31:28 +00:00
Animesh Jain	06581c277a	[dynamo][stable-diffusion] Support dict(obj) on constrained subclasses of dict and OrderedDict (#132558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132558 Approved by: https://github.com/jansel	2024-08-03 06:31:00 +00:00
Shangdi Yu	b28c01d90d	[export] Convert autocast to HOO (#131914 ) Summary: Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export run_tests("test_predispatch_autocast") ``` Reviewed By: angelayi Differential Revision: D60206382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914 Approved by: https://github.com/angelayi	2024-08-03 05:48:57 +00:00
Avik Chaudhuri	ed4493de0e	dim name is identifier (#132557 ) Summary: Dim names appear in suggested fixes so should be valid Python identifiers. Test Plan: none Differential Revision: D60696854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132557 Approved by: https://github.com/pianpwk	2024-08-03 05:28:50 +00:00
Edward Z. Yang	1f5dfe00da	Subtracer should always be real to inherit fake/real tensors from parent config (#132488 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132488 Approved by: https://github.com/zou3519	2024-08-03 04:55:42 +00:00
Justin Chu	6966d44eda	[ONNX] Rename _internal/exporter to _exporter_legacy (#132429 ) The next PR will be creating an `exporter` directory to house logic from `torch-onnx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429 Approved by: https://github.com/titaiwangms	2024-08-03 04:23:05 +00:00
David Berard	5973aec671	[fx] python_code(verbose=True): show size/strides for all tensors (#132192 ) python_code(verbose=True) (or print_readable()) generates a string with the code representing the fx graph, with extra annotations indicating the size or stride of the tensor. Currently, it'll only shows sizes/strides for FakeTensors provided in metadata. For subclass tensors like NestedTensor, the outer class (provided in the node metadata) will be a non-FakeTensor and the inner tensors will be fake. This PR expands the conditional to show sizes/strides for all tensors, not just FakeTensors. Testing: I ran this test script (below), ran it with `TORCH_LOGS=+dynamo` and found in the logs the graph shown below - we see that the input nested tensor has sizes and strides associated with it. Also, I stacked a diff on top of this one that forces the readable graph to be generated whenever PT2 is in use in tests, which should hopefully find any issues; https://github.com/pytorch/pytorch/pull/132195 shows no significant failures except for preexisting failures. test script: ```python import torch def fn(x): return x.cos() nt = torch.nested.nested_tensor_from_jagged( torch.randn(10, 10), torch.tensor([0, 1, 3, 6, 10]), ) torch.compile(fn)(nt) ``` logs excerpt: ``` [0/0] [__graph_code] TRACED GRAPH [0/0] [__graph_code] ===== __compiled_fn_1 ===== [0/0] [__graph_code] /data/users/dberard/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.M [0/0] [__graph_code] def forward(self, L_x_: "f32[4, zf1, 10][10zf1, 10, 1]cpu", zf1: "Sym(zf1)"): [0/0] [__graph_code] l_x_ = L_x_ [0/0] [__graph_code] [0/0] [__graph_code] # File: /data/users/dberard/scripts/nt_print_graph.py:4 in fn, code: return x.c [0/0] [__graph_code] cos: "f32[4, zf1, 10][10zf1, 10, 1]cpu" = l_x_.cos(); l_x_ = None [0/0] [__graph_code] return (cos,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132192 Approved by: https://github.com/Chillee	2024-08-03 02:54:32 +00:00
Ivan Zaitsev	0b571b1058	[codemod][pyre] Add missing Pyre mode headers (#132548 ) Reviewed By: connernilsen Differential Revision: D59849027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132548 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-03 02:32:53 +00:00
Yanbo Liang	373e9be457	[Inductor][FlexAttention] Add kwarg to top level for users to specify kernel params (#132015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132015 Approved by: https://github.com/Chillee	2024-08-03 02:27:02 +00:00
David Berard	25903f3932	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007 ghstack dependencies: #132356, #132466	2024-08-03 02:23:44 +00:00
Animesh Jain	419b76c4ac	[dynamo] Reland 132308, 132314, 132318, 132334 - Make builtin nn modules attributes static (#132539 ) Relanding 4 PRs ending at https://github.com/pytorch/pytorch/pull/132334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132539 Approved by: https://github.com/Skylion007, https://github.com/yanboliang, https://github.com/mlazos	2024-08-03 02:08:22 +00:00
Ivan Zaitsev	841cadd555	Fix discrepancies from 129973 (#132545 ) #129973 ([D59132793](https://www.internalfb.com/diff/D59132793)) was exported missing changes in `test/cpp/jit/CMakeLists.txt` this PR remediates that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132545 Approved by: https://github.com/kit1980	2024-08-03 01:57:49 +00:00
Eli Uriegas	243a763e1b	ci: Remove split-build CUDA testing from pull.yml (#132537 ) This is already represented in trunk.yml so it seems a bit redundant to include this level of testing in pull.yml. I've been observing a large spike in our usage of `g3.4xlarge` which seems to correspond to these builds in particular so removing these from `pull.yml` since they are already covered in `trunk.yml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132537 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2024-08-03 01:24:17 +00:00
Shangdi Yu	a503136583	[export] Detect whether case_name is registered in exportdb (#132420 ) Summary: - moves logging functionalities into `torch/_export/db/logging.py` file. - add a check in `_dynamo/eval_frame.py` to check for optional input and error out with `UnsupportedError` - change the case name of `torch_sym_int` to `unsupported_operator` - Check if the case name is registered in exportdb, if so, we give a link to the case in exportdb. - TODO: add test Test Plan: CI Running the example in https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input gives the following error logging: ``` E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] Parameter y is optional with a default value of tensor([[-0.1633, 1.2414, -0.1071], E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] [-0.1936, -0.9425, -0.0824]]) E0730 10:53:33.688000 4155538 torch/export/_trace.py:1043] See optional_input in exportdb for unsupported case. https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input ...... File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/389acaeb40d57230/tutorials/pytorch/nntest/__torchtest__/torchtest#link-tree/torch/_dynamo/eval_frame.py", line 1091, in produce_matching raise Unsupported( torch._dynamo.exc.Unsupported: Tracing through optional input is not supported yet ``` It also logs a `export.error.classified` event in Scuba. Reviewed By: zhxchen17 Differential Revision: D60427208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132420 Approved by: https://github.com/zhxchen17	2024-08-03 01:08:48 +00:00
Joel Schlosser	64720f3b89	Introduce checks to validate public API tests (#131390 ) This PR introduces a new sanity check for the public API tests in `.ci/pytorch/test.sh`. * Validates two public API tests: 1. Ensures `test_correct_module_names` fails when a new file OR an existing file adds an invalid public API function (e.g. one whose `__module__` is unset). 2. Ensures `test_modules_can_be_imported` fails when a module underneath `torch/` cannot be imported. * Runs this in CI as part just before the pre-existing FC / BC checks. I've verified that re-introducing the bug that #131386 fixed causes the new check to fail: ![public_api_failure](https://github.com/user-attachments/assets/376ddef3-d14a-41f6-93e2-f935deb6555a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131390 Approved by: https://github.com/albanD	2024-08-03 00:29:00 +00:00
cyy	fcef6cc6d1	[13/N] Fix clang-tidy warnings in jit (#132477 ) Follows #132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132477 Approved by: https://github.com/Skylion007	2024-08-03 00:13:18 +00:00
Shivam Raikundalia	705ac311aa	Fix Distributed EventList usage (#132448 ) Summary: Summarized here: https://github.com/pytorch/pytorch/issues/132227 Test Plan: Use suggestion in issue, should see test passing again Differential Revision: D60614690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132448 Approved by: https://github.com/aaronenyeshi	2024-08-02 23:55:31 +00:00
Sherlock Huang	e3513fb2af	[ts_converter]handle python list append, list add, aten.to.dtype+mutation_op pattern (#132529 ) Summary: #### Description Add support for aten::append with a python function that returns a new list with the appended element. We then update the `fx_node` in the `name_to_node` mapping. aten::append contributed by Jiashen Cao <jiashenc@meta.com> Fix conversion for csr_ranker_test ``` model_name: csr_ranker_test_4.ptl has_ts_model: True has_sample_inputs: True ops_maybe_missing_meta: set() script_objects: set() ts_can_run: True ts_run_exception: None can_convert: True convert_exception: None ep_result_correct: True ep_run_exception: None can_package: True package_exception: None sigmoid_can_run: False sigmoid_run_exception: RuntimeError('not for symbolics') sigmoid_result_correct: None ``` Test Plan: test_aten_add_t test_aten_append_t test_aten_to_dtype_with_mutating_storage buck2 run mode/opt sigmoid/inference/ts_migration:main -- --mode test_one --model_name csr_ranker_test Differential Revision: D60635893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132529 Approved by: https://github.com/jiashenC	2024-08-02 23:32:37 +00:00
David Berard	85f19ce14a	Support meta["val"] that is a dict, for triton kernels and for the partitioner (#132466 ) Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]` to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner. Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466 Approved by: https://github.com/zou3519 ghstack dependencies: #132356	2024-08-02 23:24:29 +00:00
Shivam Raikundalia	bcac71517c	[Profiler] Test Logging for Empty Traces (#132444 ) Summary: Tests D60311331. Please see that diff for explanation Test Plan: This diff is adding a test itself Reviewed By: aaronenyeshi Differential Revision: D60311555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444 Approved by: https://github.com/aaronenyeshi	2024-08-02 22:04:15 +00:00
David Berard	1962f9475f	[NJT][flop counter] attention: if offsets are fake, use max seqlen (#132356 ) The flop counter is used by the partitioner, in which case the tensors passed in can be fake. The flop computations for nested attention use the offsets to determine the actual amount of compute that will be done. But when the offsets are fake, we end up with unbacked symints (from `(offsets[1:] - offsets[:-1]).to_list()`). If we find that the offsets are fake or functional tensors, then use the max sequence length instead. Repro: https://gist.github.com/davidberard98/903fb3e586edb6d1d466786e1a610eba Differential Revision: [D60597463](https://our.internmc.facebook.com/intern/diff/D60597463) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132356 Approved by: https://github.com/soulitzer	2024-08-02 20:42:29 +00:00
Will Constable	37c3d503b7	[pipelining] Make test_schedule quiet (#132369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132369 Approved by: https://github.com/H-Huang ghstack dependencies: #129810, #130378	2024-08-02 20:38:17 +00:00
Will Constable	7c1cca9fda	[pipelining] Add schedule send/recv pass (#130378 ) Inserts send/recv ops where needed in a compute-only pipeline schedule. Any F or B action will require a recv op for its input and a send op for its output, except for at the ends of the pipeline. To avoid hangs caused by mixed-up orderings of sends/recvs across ranks, we pick one compute action at a time and insert both its send op (on that rank's schedule), and the matching recv op for the recipient stage (on the schedule for the rank for that stage). TODO Currently ignores a couple of edge cases - ignores batching (which is an optimization) - ignores cases where a stage sends to anotehr stage on the same rank, and should skip the send/recv and directly access memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378 Approved by: https://github.com/H-Huang ghstack dependencies: #129810	2024-08-02 20:38:17 +00:00
Will Constable	625f494619	[Pipelining] Add schedule unshard/reshard pass (#129810 ) Adds fsdp unshard/reshard ops to a compute-only schedule. Operates on one pp-rank's schedule at a time, since there is no cross-pp-rank coordination needed for FSDP. (Unshard/Reshard is across DP ranks within a PP group). Uses a heuristic based on examining the next N stages to run compute operations on this rank, evicting (resharding) and fetching (unsharding) ahead of time to give unshard operations a chance to overlap with compute and PP comms. - this heuristic has not been validated and may not be optimal Makes the assumption that it's fine to add the UNSHARD/RESHARD actions to the schedule regardless of if FSDP will actually be used. - this way, users do not have to tell us at PP schedule creation time if they plan to use FSDP or DDP - it is trivial to implement UNSHARD/RESHARD as no-ops inside the runtime, if FSDP is not detected on the stage module TODO - also add FSDP's reduce-scatter? or is it sufficient to leave this handled by PipelineStage at 'last backward' time - validate 'next N stages' heuristic and expose an API if needed - add an e2e test Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-08-02 20:38:17 +00:00
William Wen	f379bbd46d	[dynamo] support inspect.signature.bind (#132330 ) Fixes https://github.com/pytorch/pytorch/issues/93760. This was not that small of a task... Pull Request resolved: https://github.com/pytorch/pytorch/pull/132330 Approved by: https://github.com/jansel ghstack dependencies: #132329	2024-08-02 20:37:05 +00:00
Zhengxu Chen	642257db1a	Update the FQN for auto_functionalized HOO. (#132171 ) Summary: as title. torch._higher_order_ops.auto_functionlize.auto_functionalized is a Python FQN which should NOT be used to talk to the backends and we should use the standard FQN name torch.ops.higher_order.auto_functionalized instead. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_custom_op_auto_functionalize_pre_dispatch Differential Revision: D60468759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132171 Approved by: https://github.com/SherlockNoMad	2024-08-02 20:34:50 +00:00
David Berard	dccce77935	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-02 20:30:59 +00:00
Mikayla Gawarecki	f49d5e30eb	Change owners of test/test_transformers.py to module: multi-headed-attention (#132519 ) So flaky tests get tagged with `module: multi-headed-attention` instead of `module: nn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132519 Approved by: https://github.com/Skylion007	2024-08-02 20:12:33 +00:00
William Wen	e81e74ca6c	[dynamo] revert map/zip iterator related changes (#132528 ) Need to revert due to internal hangs: S437700 This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64. Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)" This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3. Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)" This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9. Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)" This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528 Approved by: https://github.com/ZainRizvi	2024-08-02 19:40:57 +00:00
Sam Larsen	b71cd149ce	Fix file lock issue in AotCodeCompiler (#132343 ) Summary: It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this: - The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z` - The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock` - The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp` - The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin` So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths. Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it. Differential Revision: D60552021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343 Approved by: https://github.com/desertfire	2024-08-02 19:01:37 +00:00
PyTorch MergeBot	bcb4f7c172	Revert "Grouped Query Attention (#128898 )" This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))	2024-08-02 18:58:46 +00:00
Menglu Yu	afca6f5b47	[PT2][Optimus] Add missing example value for introduced nodes (#132297 ) Summary: We observed that many introduced nodes during split cat and batch fusion pattern optimization did not have example value meta data, which will cause problems in our follow up pattern optimizations, thus we add all missing values. We also fix bugs in some meta update and corner case bug for the old pattern, which caused problems in the follow up pattern optimization. We delete merge_stack_tahn_unbind_pass pattern, which was designed for cmf model, and it could be replaced by the more advanced pattern we added, thus we remove it for easy maintenance. Test Plan: # unit test ``` buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/15481123762720165 Network: Up: 230KiB Down: 702KiB (reSessionID-756346bf-6da3-4fa0-8d03-1b4fd61e0a7a) Jobs completed: 30. Time elapsed: 7:23.9s. Cache hits: 20%. Commands: 5 (cached: 1, remote: 0, local: 4) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 ``` buck2 test @mode/opt pytorch/diff_train_tests/ads/optimus:local_pt2_runner ``` Network: Up: 1.3GiB Down: 84MiB (reSessionID-ff135cdd-e42c-4ab5-8217-907ada465f01) Jobs completed: 61. Time elapsed: 21:56.5s. Cache hits: 0%. Commands: 39 (cached: 0, remote: 0, local: 39) Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 752, 'pattern_matcher_count': 732, 'normalization_pass': 328, 'normalization_aten_pass': 12, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1, 'fxgraph_cache_miss': 1}) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132297 Approved by: https://github.com/jackiexu1992	2024-08-02 18:57:12 +00:00
PyTorch MergeBot	24d0a32f98	Revert "[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308 )" This reverts commit aa0ed2496f5bf38768c9eda13112fd43359548bb. Reverted https://github.com/pytorch/pytorch/pull/132308 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132308#issuecomment-2265959993))	2024-08-02 18:55:51 +00:00
PyTorch MergeBot	e696f17467	Revert "[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314 )" This reverts commit d6a82ce39bd8e705a4cc2cebb886f4476a7250cf. Reverted https://github.com/pytorch/pytorch/pull/132314 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132314#issuecomment-2265953367))	2024-08-02 18:52:38 +00:00
PyTorch MergeBot	e4e3575fb0	Revert "[11/N] Use std::nullopt and std::optional (#132396 )" This reverts commit d7d61904936617a6a43782868d0b1004cb70dfc0. Reverted https://github.com/pytorch/pytorch/pull/132396 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/132396#issuecomment-2265952528))	2024-08-02 18:49:42 +00:00
PyTorch MergeBot	59b73079a0	Revert "Always use high precision for SDPA math backend (#128922 )" This reverts commit fbf3bc0a602b4ec1eab169202d5b1158fe2c1def. Reverted https://github.com/pytorch/pytorch/pull/128922 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128922#issuecomment-2265949958))	2024-08-02 18:46:50 +00:00
PyTorch MergeBot	193a19ee91	Revert "[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318 )" This reverts commit 7b816d7d6d5d521f913c78f897790f66112c7d84. Reverted https://github.com/pytorch/pytorch/pull/132318 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132318#issuecomment-2265945433))	2024-08-02 18:43:32 +00:00
PyTorch MergeBot	b8f7019df0	Revert "[dynamo] Track params/buffers and mark them as static (#132334 )" This reverts commit babb249a89b51931afe16db8b498ff72cd433afc. Reverted https://github.com/pytorch/pytorch/pull/132334 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132334#issuecomment-2265942261))	2024-08-02 18:41:19 +00:00
Bin Bao	e0514a5b99	[AOTI][refactor] Consolidate how python_kernel_name is set (#132320 ) Summary: Similar to the refactoring of set_cpp_kernel, consolidate the ways of setting python_kernel_name Pull Request resolved: https://github.com/pytorch/pytorch/pull/132320 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #132319	2024-08-02 18:34:25 +00:00
Bin Bao	a9e1133faa	[AOTI][refactor] Move set_cpp_kernel to base class (#132319 ) Summary: Consolidate how cpp_kernel_name is set and make it a method in the base ExternKernel class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132319 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-08-02 18:34:24 +00:00
Aleksei Nikiforov	df781343e2	Link libc10 to pthreads (#132484 ) It gets linked as transitive dependency of `libmkl` on x86_64, but it's must be specified explicitly on s390x Linking issue only appears when using gcc-13 with gold linker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132484 Approved by: https://github.com/malfet	2024-08-02 18:03:44 +00:00
Yidi Wu	19897a1647	[export] change deepcopy to copy in _replace_set_grad_with_hop pass.. (#132181 ) Summary: Fixes T197371132. Previously, we call copy.deepcopy to avoid mutating the original signature. However, this causes errors when the signature reference a FakeScriptObject, which then references a real torch.ScriptObject due to "The tensor has a non-zero number of elements, but its data is not allocated yet." We therefore just change it to a shallow copy. This should be good enough for guarding the signature. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_ebc_non_strict_export" Differential Revision: D60476839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132181 Approved by: https://github.com/BoyuanFeng	2024-08-02 17:57:09 +00:00
cyy	87d58cc81f	[4/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132001 ) Follows #132000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132001 Approved by: https://github.com/Skylion007	2024-08-02 17:42:02 +00:00
cyy	207e24ff83	Enable clang-tidy on aten/src/ATen/cudnn/* (#130133 ) Continued work of applying clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/130133 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-08-02 17:39:37 +00:00
Justin Chu	0c491702c4	[ONNX] Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag (#132299 ) Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag to allow for enabling the new torch.onnx logic and hiding them during migration and testing. The actual logic migration will happen after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132299 Approved by: https://github.com/titaiwangms	2024-08-02 17:06:11 +00:00
David Berard	9167113c16	[easy][MPS] add torch.mps.is_available() (#132426 ) Just return "torch.mps.device_count() > 0", which, based on the implementation of device_count(), seems to be equivalent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132426 Approved by: https://github.com/malfet	2024-08-02 17:05:49 +00:00
Edward Z. Yang	fc32732596	Don't attempt to compute hints for unbacked expressions (#132060 ) This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060 Approved by: https://github.com/Skylion007	2024-08-02 16:39:14 +00:00
PyTorch MergeBot	8fff976355	Revert "Refactor thunkify to return proper thunk abstraction (#132407 )" This reverts commit d903e664c6b70ad17e0b316ef39d71be5edddc87. Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))	2024-08-02 16:32:43 +00:00
PyTorch MergeBot	1197550876	Revert "Don't attempt to compute hints for unbacked expressions (#132060 )" This reverts commit d342dc0179944dd317b509b3432da81701836444. Reverted https://github.com/pytorch/pytorch/pull/132060 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))	2024-08-02 16:32:43 +00:00
Edward Z. Yang	296c339f98	Ensure compiler collective is called even when no graph is compiled (#132163 ) It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163 Approved by: https://github.com/jansel	2024-08-02 16:31:54 +00:00
soulitzer	82b6480b0a	Update SavedTensorHooks TLS stack to use SafePyObject (#131700 ) Previously, we must manually manage refcounting when updating the TLS saved variable stack. With this PR, things should be handled automatically by the SafePyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131700 Approved by: https://github.com/albanD	2024-08-02 16:27:16 +00:00
PyTorch MergeBot	9eeb5eebab	Revert "Ensure compiler collective is called even when no graph is compiled (#132163 )" This reverts commit 0d9c9716b2db52281f6f10a113e07936deeb6e0a. Reverted https://github.com/pytorch/pytorch/pull/132163 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132163#issuecomment-2265729449))	2024-08-02 16:16:31 +00:00
Andrii Grynenko	fca2dba7ca	[pytorch][counters] Pybind for WaitCounter (#132357 ) Summary: Basic pybind integration for WaitCounter providing a guard API. Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API). Test Plan: unit test Differential Revision: D60557660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132357 Approved by: https://github.com/jamesperng, https://github.com/asiab4	2024-08-02 16:08:10 +00:00
PyTorch MergeBot	d224857b3a	Revert "Change signature of CompilerFn for register_backend decorator (#131880 )" This reverts commit ccf9ce8e8c3c86269003547d976da5ed1fc9511b. Reverted https://github.com/pytorch/pytorch/pull/131880 on behalf of https://github.com/albanD due to Breaking lint ([comment](https://github.com/pytorch/pytorch/pull/131880#issuecomment-2265682757))	2024-08-02 15:49:09 +00:00
Edward Z. Yang	63eb06c051	Disable SymDispatchMode when torch.compile'ing (#132433 ) Partially addresses https://github.com/pytorch/pytorch/issues/132417 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433 Approved by: https://github.com/ydwu4	2024-08-02 15:23:49 +00:00
cyy	5aafdc2f87	[3/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132000 ) Follows #131834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132000 Approved by: https://github.com/ezyang	2024-08-02 15:00:38 +00:00
Yan Zhiwei	78f4a3919f	Remove duplicate XPU switch case in DispatchStub (#132480 ) This PR fixes the issue mentioned in https://github.com/pytorch/pytorch/issues/132481. Duplicated XPU switch cases exist in `DispatchStub.cpp` and this PR removes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132480 Approved by: https://github.com/nautsimon, https://github.com/malfet	2024-08-02 14:39:00 +00:00
redradist	ccf9ce8e8c	Change signature of CompilerFn for register_backend decorator (#131880 ) ## Description Add `...` to show that CompilerFn for custom backend could take additional options Re: Recreated closed PR https://github.com/pytorch/pytorch/pull/110006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131880 Approved by: https://github.com/jansel	2024-08-02 14:30:58 +00:00
Nick Westlake	053e5080f6	Enable exception chaining in call_user_compiler (#131186 ) Enable exception chaining of BackendCompilerFailed exception in call_user_compiler. This prevents the original exception and traceback, which is often the most useful for debugging, from being discarded. Example output without the patch > Traceback (most recent call last): > [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(] > [Trace back from call_user_compiler to _inplace_generalized_scatter raise RuntimeError] > torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information Example output with the patch > Traceback (most recent call last): > [Traceback from_inplace_generalized_scatter to raise error_type(message_evaluated)] > RuntimeError: expand: attempting to expand a dimension of length 2! > The above exception was the direct cause of the following exception: > Traceback (most recent call last): > [Traceback from call_user_compiler to _inplace_generalized_scatter raise RuntimeError] > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > The above exception was the direct cause of the following exception: > Traceback (most recent call last): > [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e) with e] > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information Pull Request resolved: https://github.com/pytorch/pytorch/pull/131186 Approved by: https://github.com/jansel	2024-08-02 14:07:06 +00:00
Alnis Murtovi	48929184e9	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison	2024-08-02 13:54:37 +00:00
cyy	b9cb1abf65	[12/N] Use std::optional (#132361 ) Follows #132396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361 Approved by: https://github.com/eqy	2024-08-02 13:46:46 +00:00
Animesh Jain	56f2917bef	[dynamo] Bugfix for recently added str handler (#132461 ) There is probably more work to improve support. But this is hot fix to not fail on `.__func__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132461 Approved by: https://github.com/williamwen42 ghstack dependencies: #132425	2024-08-02 13:16:39 +00:00
Edward Z. Yang	0d9c9716b2	Ensure compiler collective is called even when no graph is compiled (#132163 ) It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163 Approved by: https://github.com/jansel	2024-08-02 12:18:34 +00:00
Edward Z. Yang	d342dc0179	Don't attempt to compute hints for unbacked expressions (#132060 ) This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060 Approved by: https://github.com/Skylion007 ghstack dependencies: #131649, #132407	2024-08-02 12:09:37 +00:00
Edward Z. Yang	d903e664c6	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD ghstack dependencies: #131649	2024-08-02 12:09:37 +00:00
Edward Z. Yang	290f09f829	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-02 12:00:46 +00:00
Xu Han	8668bc279d	[inductor] contine to fix restrict keyword. (#132463 ) It is a continued work to the PR: https://github.com/pytorch/pytorch/pull/132394 , and all `restrict` key word of `cpp_micro_gemm.py` are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132463 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-02 11:09:17 +00:00
Michael Lazos	d2e9a8bf6d	[Reland] Fix inlining module-scoped store global (#132439 ) Reland https://github.com/pytorch/pytorch/pull/132224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132439 Approved by: https://github.com/anijain2305	2024-08-02 09:13:52 +00:00
Pearu Peterson	a4ea776881	Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645 ) As in the title: To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported - `sparse_xyz_tensor(indices, values, pin_memory=True)` - `sparse_xyz_tensor(indices, values).pin_memory()` - `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())` Fixes https://github.com/pytorch/pytorch/issues/115330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645 Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy	2024-08-02 08:55:55 +00:00
Animesh Jain	babb249a89	[dynamo] Track params/buffers and mark them as static (#132334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132334 Approved by: https://github.com/ezyang, https://github.com/mlazos	2024-08-02 08:55:43 +00:00
xinyu-intel	2ee9895304	Support optimizer capturable on hpu and xpu (#132119 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/132119 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-08-02 08:19:52 +00:00
zengxian	f936e68506	[CI] Update CPU inductor smoke test model list and target (#132221 ) Fixes #132097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221 Approved by: https://github.com/desertfire	2024-08-02 07:09:54 +00:00
eqy	e5560d10f4	[CUDA][SDPA] Fix expect export on sm90+ (#132194 ) CC @drisspg not sure what is causing the scale=0.125 to be omitted here... Pull Request resolved: https://github.com/pytorch/pytorch/pull/132194 Approved by: https://github.com/drisspg	2024-08-02 05:43:58 +00:00
David Berard	7d8b95e8fb	[easy] more debug in partitioner assert (#132456 ) Print the name of the node that didn't have good meta['val']. An internal model is failing with this assert, we need this info to debug further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132456 Approved by: https://github.com/Chillee	2024-08-02 05:07:01 +00:00
cyy	35d14d22a0	Fix some issues detected by static analysis tools (#131989 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131989 Approved by: https://github.com/ezyang	2024-08-02 04:18:57 +00:00
Yanbo Liang	5ea0f51187	[Dynamo] Support abc.MutableMapping.get (#132363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132363 Approved by: https://github.com/anijain2305, https://github.com/mlazos	2024-08-02 04:17:35 +00:00
drisspg	2b86a7fcc7	fix printing of scores and mods names (#132424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132424 Approved by: https://github.com/Skylion007	2024-08-02 03:30:23 +00:00
cyy	07fe1dd58f	[13/N] Fix clang-tidy warnings in jit (#132411 ) Follows #132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132411 Approved by: https://github.com/Skylion007	2024-08-02 03:14:09 +00:00
James Wu	1250171866	Use fresh inductor cache on unit tests (#132432 ) Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test. Test Plan: ``` buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results ``` Now passes Differential Revision: D60604811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432 Approved by: https://github.com/masnesral	2024-08-02 03:02:36 +00:00
Animesh Jain	6c4ce4331c	[dynamo][exception] Raise Observed KeyError exception for dict __getitem__ (#132425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132425 Approved by: https://github.com/yanboliang, https://github.com/Skylion007	2024-08-02 02:58:31 +00:00
Nikita Shulga	cd5452aace	[CUDA] `is_bf16_supported()` should not crash if there are no GPUs (#132313 ) `False` is the good answer on a system that does not have any CUDA GPUs. - Added regression test to TestTorch. Fixes https://github.com/pytorch/pytorch/issues/132303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132313 Approved by: https://github.com/eqy, https://github.com/syed-ahmed	2024-08-02 02:50:43 +00:00
majing	3a355c1891	Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630 ) Fixes #130916 As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram. The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-08-02 01:51:09 +00:00
Chien-Chin Huang	bc510916fa	Only make wait_tensor as a side_effect op (#132341 ) Summary: https://github.com/pytorch/pytorch/pull/131023 add all the collective ops to the side effect list. But we should only make wait_tensor as a side_effect op because all collective ops should have a corresponding wait_tensor. We should switch to use high_order effect token. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132341 Approved by: https://github.com/yf225	2024-08-02 01:24:40 +00:00
Yichen Yan	ef426d5183	[nccl] Wrap nccl code update with version check (#130419 ) Fixes the issue that cannot build pytorch with nccl < 2.13 after https://github.com/pytorch/pytorch/issues/128756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130419 Approved by: https://github.com/eqy, https://github.com/malfet	2024-08-02 01:22:07 +00:00
Chen Haifeng	50ed6ce277	Support built-in id function for TensorVariable on parameters (#130100 ) Fixes #130087 This patch tries to provide a built-in id function implementation for TensorVariable when the id function is called on tensors like module parameters. The id function call on intermediate tensors is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130100 Approved by: https://github.com/anijain2305	2024-08-02 01:19:25 +00:00
Siyu Yang	64235c6a71	Skip test_fp8 in test_aot_inductor to temporarily (#132453 ) https://github.com/pytorch/pytorch/pull/130422 caused the test `test.inductor.test_aot_inductor.AOTInductorTestABICompatibleCuda. test_fp8_abi_compatible_cuda` to fail (unclear why it was not run in GitHub) with `torch/csrc/inductor/aoti_torch/c/shim.h:390:34: note: candidate function not viable: requires 9 arguments, but 6 were provided`. We suspect that the kernel produced by the lowering function, which is no longer a fallback choice, has a schema issue at codegen. Fp8 is not used through AOTI currently and it is difficult to revert the PR (BE week), so we'll skip the test temporarily while making the new lowering compatible with AOTI. Testing: the failed test on internal diff is now skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132453 Approved by: https://github.com/henrylhtsang	2024-08-02 01:18:03 +00:00
cyy	56334c854c	[2/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#131834 ) Follows #130798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131834 Approved by: https://github.com/ezyang	2024-08-02 00:49:30 +00:00
Avik Chaudhuri	ee1ef066fd	add src map to data-dependent errors (#132393 ) Summary: Currently suggested fixes pick a map from symbols to user variables. However it is possible that many user variables point to the same symbol, and some may be preferred over others. Thus we dump this info as well. Test Plan: updated test Sample error with new format: ``` Could not guard on data-dependent expression u2 >= 0 (unhinted: u2 >= 0). (Size-like symbols: none) <snip> The following call raised this error: File "test/export/test_export.py", line 1950, in forward return r.view(items[0], items[2]) To fix the error, insert one of the following checks before this call: 1. torch._check(items[2] >= 0) 2. torch._check(items[2] < 0) (These suggested fixes were derived by replacing `u2` with items[2] in u2 >= 0 and its negation.) ``` Differential Revision: D60574478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132393 Approved by: https://github.com/BoyuanFeng	2024-08-02 00:31:12 +00:00
William Wen	625af2d27c	[dynamo] fix add_push_null callsites with CALL_FUNCTION_EX (#132329 ) Also fix a bug in `PyCodegen.add_push_null` where in Python <= 3.12, we may accidentally duplicate a NULL instead of the object on the stack before it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132329 Approved by: https://github.com/anijain2305	2024-08-02 00:29:21 +00:00
atalman	0016be8051	[Docker] Replace epel release rpm by yum install (#132449 ) URL: https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm is not available anymore, hence replacing this with yum epel-release install. As a backup plan this is available still : https://archives.fedoraproject.org/pub/archive/epel/7/x86_64/Packages/e/epel-release-7-14.noarch.rpm Saved on our s3 path, just in case: https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm Please note, We are still using for installs like this: ``` RUN yum install -y \ https://repo.ius.io/ius-release-el7.rpm \ https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm ``` Test in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/132449 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-08-02 00:16:03 +00:00
PyTorch MergeBot	3855ac5a5d	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit ab9791c0e342753013181eeeab300a05774fc456. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/angelayi due to never got landed internally due to weird flow... sorry ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2264224466))	2024-08-01 23:47:29 +00:00
henrylhtsang	0c3ac428a2	[BE][typing] fix types in common pruning (#132309 ) BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309 Approved by: https://github.com/ColinPeppler	2024-08-01 23:34:33 +00:00
Mikayla Gawarecki	87ddf70fc6	Set weights_only=False in export `deserialize_torch_artifact` (#132348 ) Context: We are planning to make a BC breaking change to `torch.load` by flipping the default for `weights_only` from `False` --> `True` in a future release. With `weights_only=True`, a custom unpickler is used that limits what can be loaded to state_dicts containing tensors (there is also a way for the user to allowlist specific things to be loaded). The goal of this is to attempt to prevent remote execution of arbitrary code when using `torch.load`. To my understanding, in export, `torch.load` is used internally to load arbitrary objects, so we should set `weights_only=False` here to prevent the flip from breaking export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132348 Approved by: https://github.com/angelayi	2024-08-01 23:25:07 +00:00
Shangdi Yu	1362d51e7d	[AOTI] Fix number type for AOTI (#132180 ) Fixes #131338 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132180 Approved by: https://github.com/desertfire	2024-08-01 22:43:28 +00:00
Yidi Wu	35400f750f	[torchbind] don't warning for certain skippable methods. (#132306 ) Summary: Skip the warning if the fake script object doesn't implement a fake method for: 1. __obj_flatten__: for real script object only. 2. __set_state__ and __get_state__ for serialization. Don't expect it to be used during tracing. Test Plan: Existing tests. Reviewed By: angelayi Differential Revision: D60478460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132306 Approved by: https://github.com/angelayi	2024-08-01 22:40:42 +00:00
Shangdi Yu	2f54c38594	[AOTI] Fix bfloat16 in CPU (#132150 ) Fixes #122986 - add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file - Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare] 436 \| if (tensor.numel() != numel) { Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-08-01 22:26:30 +00:00
Joel Schlosser	a356a03f4a	Fix DEBUG=1 asserts for mvlgamma backward with NJT (#132422 ) mvlgamma backward trips DEBUG=1 asserts when trying to construct an empty tensor with `layout=torch.jagged`. This happens due to passing `self.options()` to `arange()` in `mvlgamma_backward()`. Fix in this PR unconditionally constructs `arange()` with the strided layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132422 Approved by: https://github.com/albanD	2024-08-01 21:53:16 +00:00
Yu, Guangye	92bebb46fa	Support XPU ABI=0 build (#130110 ) # Motivation This PR intends to support ABI=0 build for XPU backend. # Additional Context The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why? Because we use - gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And - use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides, - `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not. This PR depends on https://github.com/pytorch/pytorch/pull/131643. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-08-01 21:42:14 +00:00
Brian Hirsh	997f64af38	fastpath FunctionalTensor sizes() (#132084 ) Another attempt at fast-pathing sizes() in FunctionalTensor, since it appears to improve compile time perf by up to ~10%. See the investigation from https://github.com/pytorch/pytorch/issues/125977#issuecomment-2122915602. After looking at some failing tests locally I realized that we need to manually handle metadata mutations now, since the previous "smarter" size dispatch was handling the updates Pull Request resolved: https://github.com/pytorch/pytorch/pull/132084 Approved by: https://github.com/ezyang	2024-08-01 21:09:22 +00:00
PyTorch MergeBot	c8958f8f84	Revert "Ban decorator usage of dynamo_timed (#132328 )" This reverts commit 9853c048eb53946eb505424b17ac42ce46b66ac1. Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](`9853c048eb`). Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))	2024-08-01 20:20:28 +00:00
Oguz Ulgen	78927d37f6	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-01 20:14:25 +00:00
Oguz Ulgen	71e22e0959	Add basic mypy annotations to dynamo (#132415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu	2024-08-01 20:14:25 +00:00
Simon	12f61e65eb	[mtia][sdpa] MTIA SDPA dispatch via _fused_sdp_choice_stub (#132008 ) Summary: as title Differential Revision: D59823335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132008 Approved by: https://github.com/mortzur	2024-08-01 20:01:40 +00:00
Anshul Sinha	596f568592	[dtensor][debug] adding js script to pytorch github so that i can host the browser visualizer on pytorch (#132185 ) Summary This is the javascript portion that is used in CommDebugMode's visual browser. I have placed it here so that I can host the browser on PyTorch. I am following the same procedures to host as memory_viz https://github.com/pytorch/pytorch.github.io/blob/site/memory_viz.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/132185 Approved by: https://github.com/XilunWu ghstack dependencies: #132070	2024-08-01 19:50:23 +00:00
Edward Z. Yang	9853c048eb	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-01 19:27:58 +00:00
PyTorch MergeBot	40c8f73099	Revert "Fix inlining module-scoped store global (#132224 )" This reverts commit c3a31d90e7d10a9b89b11396b6f8b20ed52bf394. Reverted https://github.com/pytorch/pytorch/pull/132224 on behalf of https://github.com/ZainRizvi due to Looks like the new import mock_store_global_crossfile_inline fails internally. Please see D60567756 for details ([comment](https://github.com/pytorch/pytorch/pull/132224#issuecomment-2263768729))	2024-08-01 19:06:36 +00:00
Michael Lazos	93979e7063	Skip frame if torch dispatch mode enabled (#131828 ) Fixes https://github.com/pytorch/pytorch/issues/105929 We now skip frames if a dispatch mode is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131828 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2024-08-01 19:06:20 +00:00
Jianyu Huang	fbf3bc0a60	Always use high precision for SDPA math backend (#128922 ) Summary: feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts. Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16. Differential Revision: D58710805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922 Approved by: https://github.com/xw285cornell, https://github.com/drisspg	2024-08-01 18:55:48 +00:00
eellison	0eea2b3947	Cast inputs to low precision kernels in emulate low precision mode (#132345 ) With https://github.com/pytorch/pytorch/pull/132238 is sufficient to make give no divergence https://github.com/pytorch/pytorch/issues/132301: Although we should discuss that issue more at length. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132345 Approved by: https://github.com/zou3519	2024-08-01 18:02:10 +00:00
Ryo	ce61300141	Enable oneDNN for tanh based GELU on aarch64 (#130925 ) Provides speedup for GELU on aarch64 compared to native PyTorch implementation. e.g. 8.5x speedup compared to native implementation for 1x1x16384 on 32 threads on Graviton 3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130925 Approved by: https://github.com/malfet	2024-08-01 17:54:48 +00:00
Bin Bao	97eba8e174	[AOTI] Fix a typo in ExternKernel.codegen_const_args (#132191 ) Differential Revision: [D60513923](https://our.internmc.facebook.com/intern/diff/D60513923) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132191 Approved by: https://github.com/chenyang78	2024-08-01 17:46:25 +00:00
James Wu	f467d55329	Disable remote cache on test_aot_autograd_cache (#132409 ) Summary: AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss. Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled passes Differential Revision: D60588615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409 Approved by: https://github.com/masnesral	2024-08-01 17:26:11 +00:00
angelayi	010fc7858a	[export] Fix serialization of OpOverload w/ SymInt outputs (#132126 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1473575486613991/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/132126 Approved by: https://github.com/ydwu4	2024-08-01 17:22:04 +00:00
Xuehai Pan	ff4ca0d02a	[Easy] Fix argument name collision in `HigherOrderOperator` dispatched functions (#132377 ) Share the same spirit of #129562 - #129562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132377 Approved by: https://github.com/zou3519	2024-08-01 17:13:37 +00:00
Animesh Jain	7b816d7d6d	[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318 ) This fixes the huge increase in compile time with +dynamic with inline_inbuilt_nn_modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132318 Approved by: https://github.com/yanboliang, https://github.com/mlazos, https://github.com/ezyang ghstack dependencies: #132302, #132304, #132312, #132308, #132314	2024-08-01 17:11:18 +00:00
pratiklp00	69cbf05529	Fix recent build error on ppc64le (#129736 ) This PR will fix the recent build issue observed on ppc64le. Fixes #128130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129736 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-01 17:09:42 +00:00
Xuehai Pan	30293319a8	[BE][Easy][19/19] enforce style for empty lines in import segments in `torch/[o-z]*/` (#129771 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2024-08-01 17:07:14 +00:00
Howard Huang	c59f3fff52	[PP] Forward only schedule (#132177 ) `python test/distributed/pipelining/test_schedule_multiproc.py -k test_forward_only` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132177 Approved by: https://github.com/lessw2020	2024-08-01 16:35:56 +00:00
Yiming Zhou	ee09d066d3	[dynamo] Add line number to _warn_capture_scalar_outputs() (#132333 ) Fixes #127667. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132333 Approved by: https://github.com/anijain2305	2024-08-01 16:11:21 +00:00
Xu Han	35fcd59fd8	[inductor] make restrict_keyword cross OSs. (#132394 ) Error Msg: <img width="862" alt="image" src="https://github.com/user-attachments/assets/51fef188-bce8-42a5-8ed4-d11802c6ca89"> <img width="347" alt="image" src="https://github.com/user-attachments/assets/0eafe38e-1c7c-427d-82f5-16a31bccc476"> Handle `restrict` keyword the by OS, ref: https://learn.microsoft.com/en-us/cpp/cpp/extension-restrict?view=msvc-170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132394 Approved by: https://github.com/desertfire	2024-08-01 16:03:10 +00:00
Oguz Ulgen	920f0426ae	Add None return type to init -- tests rest (#132376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335, #132351, #132352	2024-08-01 15:44:51 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
Oguz Ulgen	a6985c09cb	Add None return type to init -- functorch and torchgen (#132351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335	2024-08-01 15:26:45 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
atalman	30d7f0b15a	Remove wget call to builder install_cuda.sh (#132410 ) This file ``install_cuda.sh`` now lives in ``.ci/docker/common`` and will be removed from builder repo. Here is PR that removes it from builder: https://github.com/pytorch/builder/pull/1949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132410 Approved by: https://github.com/Skylion007	2024-08-01 15:22:08 +00:00
cyy	c99adce9a1	[12/N] Fix clang-tidy warnings in jit (#132209 ) Follows #132131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132209 Approved by: https://github.com/Skylion007	2024-08-01 15:12:12 +00:00
Justin Chu	0d88dd0f77	[TS2E] Remove reference to torch.onnx internals (#132186 ) Instead, this PR moves the code to the converter to avoid dependence. Feel free to refactor it afterward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132186 Approved by: https://github.com/angelayi	2024-08-01 15:08:02 +00:00
cyy	d7d6190493	[11/N] Use std::nullopt and std::optional (#132396 ) Follows #132364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132396 Approved by: https://github.com/ezyang	2024-08-01 14:46:33 +00:00
Xu Han	a4013e8b72	[inductor] cpp codegen alignas for all OSs. (#132387 ) Changes: 1. Make cpp codegen alignas works for all OSs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132387 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-01 14:30:09 +00:00
Xu Han	6c1f1563e1	[inductor] fix UndefinedTensorImpl singleton can't export on Windows. (#132326 ) This PR fix the `UndefinedTensorImpl::_singleton` can't export on Windows issue. Snapshot: <img width="1346" alt="image" src="https://github.com/user-attachments/assets/b34256ac-a0ae-473b-89e6-10d755eaad24"> The reason is MSVC can't export class static data to external linkage, ref: https://learn.microsoft.com/en-us/cpp/cpp/using-dllimport-and-dllexport-in-cpp-classes?view=msvc-170#_pluslang_using_dllimport_and_dllexport_in_c2b2bselectivememberimportexport I use another singleton implenmentation to avoid the issue, for Windows. Since this PR, cpp_wrapper on Windows would start to work. <img width="1916" alt="image" src="https://github.com/user-attachments/assets/c1d7d7e7-64ca-4c6d-9fb7-e3b91e675b58"> Next step, I will enable the cpp_wrapper UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132326 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-01 13:37:12 +00:00
Xuehai Pan	6ff1e43a41	[BE][Easy][13/19] enforce style for empty lines in import segments in `test/j*/` (#129764 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129764 Approved by: https://github.com/ezyang	2024-08-01 12:13:42 +00:00
Xuehai Pan	672ce4610e	Populate submodules of `torch._C` to `sys.modules` recursively (#132216 ) See comment: `e9d1c26275/torch/__init__.py (L938-L950)` This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216 Approved by: https://github.com/ezyang	2024-08-01 12:04:59 +00:00
Max Ren	d95756f6a5	[Quantizer][Add] Fix add annotation with constant (#132092 ) Summary: Occaisonally we run into a partition that looks like this for Add: ``` SourcePartition(nodes=[_constant2, add_2], source=<built-in function add>, input_nodes=[x], output_nodes=[_constant2, add_2], params=[_constant2]) ``` In this case we are adding a constant to an input, and reusing the constant later down the line. This causes our constant to be an output in our SourcePartition. The assumption then that: ``` add_node = add_partition.output_nodes[0] ``` Will not necessarily hold. As a result we must check that the output node is indeed a call function and not a constant. Test Plan: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_ops -- test_qs8_add_constant Differential Revision: D60413221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132092 Approved by: https://github.com/jerryzh168	2024-08-01 09:57:43 +00:00
joydddd	bdd83c4c7f	Add Full block support to flex_decoding (#131404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131404 Approved by: https://github.com/yanboliang	2024-08-01 07:28:52 +00:00
cyy	043e41f4f4	[10/N] Use std::nullopt and std::make_optional (#132364 ) Follows #130674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132364 Approved by: https://github.com/ezyang	2024-08-01 07:02:35 +00:00
Animesh Jain	d6a82ce39b	[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132314 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304, #132312, #132308	2024-08-01 06:21:05 +00:00
Animesh Jain	aa0ed2496f	[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132308 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304, #132312	2024-08-01 06:21:05 +00:00
Animesh Jain	612ea35395	[dynamo] Introduce UnspecializedBuiltinNNModuleSource (#132312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132312 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304	2024-08-01 06:21:05 +00:00
Tugsbayasgalan Manlaibaatar	4c29c1a96a	[EZ] adjust test to accept training IR input (#131999 ) When we do predispatch functional export, sometimes we get harmless additional detach calls. In the new training IR, it actually outputs slightly different (arguable more correct) result. Differential Revision: [D60348764](https://our.internmc.facebook.com/intern/diff/D60348764/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131999 Approved by: https://github.com/bdhirsh ghstack dependencies: #131988, #131995	2024-08-01 06:20:38 +00:00
Matthew Hoffman	7a779b5257	Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 ) Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error: ``` "mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage] ``` Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288 Approved by: https://github.com/ezyang	2024-08-01 05:45:08 +00:00
Tugsbayasgalan Manlaibaatar	928adb7cc2	Fix empty fake mode problem (#131995 ) Title Differential Revision: [D60348541](https://our.internmc.facebook.com/intern/diff/D60348541/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131995 Approved by: https://github.com/angelayi ghstack dependencies: #131988	2024-08-01 04:55:37 +00:00
eellison	f32ab3b9e3	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-08-01 04:37:15 +00:00
Animesh Jain	bcd1d2e832	[dynamo] Introduce UnspecializedNNModule guard source (#132304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132304 Approved by: https://github.com/yanboliang ghstack dependencies: #132302	2024-08-01 04:35:43 +00:00
Animesh Jain	e772547d70	[dynamo][rename/refactor] Rename guard_source NN_MODULE to SPECIALIZED_NN_MODULE (#132302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132302 Approved by: https://github.com/yanboliang	2024-08-01 04:35:43 +00:00
Dan Zimmerman	90fa64bd7e	[torch][take2] Implement BFloat16 __hip_bfloat16 overloads (#132234 ) Summary: In D60024830 I attempted to define these overloads, but gated the implementation on the wrong macros. Namely I used `__CUDACC__` instead of `__HIPCC__` (facepalm). It might be worth merging this with the nvidia case via typedefs (e.g. `typedef __hip_bfloat16 __gpu_bfloat16` and `typedef __nv_bfloat16 __gpu_bfloat16`), but that seems like an entirely new paradigm for torch, so I'll punt that change to the future so we can focus on supporting `BFloat16(__hip_bfloat16)` here Test Plan: CI Differential Revision: D60362079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132234 Approved by: https://github.com/houseroad	2024-08-01 04:25:46 +00:00
Jiong Gong	7911b7bfb7	[inductor][cpp] stabilize do_bench_cpu (#131873 ) This PR stabilizes the `do_bench_cpu` by using milliseconds for warmup and benchmark runs, aligning with that of Trtion's do_bench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131873 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/eellison	2024-08-01 04:25:31 +00:00
Xuehai Pan	b25ef91bf1	[BE][Easy][18/19] enforce style for empty lines in import segments in `torch/d*/` (#129770 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770 Approved by: https://github.com/wconstab	2024-08-01 04:22:50 +00:00
Wei Feng	bc7ed1fbdc	[FSDP2] add __repr__ to FSDPParamGroup and FSDPParam (#132350 ) in pdb, it's pretty common to print `FSDPParamGroup` and `FSDPParam`. making sure they are human readable print `FSDPParam` in pdb ``` FSDPParam(fqn=layers.6._checkpoint_wrapped_module.attention.wq.weight, orig_size=torch.Size([128, 256])) ``` print `FSDPParamGroup` in pdb ``` FSDPParamGroup(fqn=layers.6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132350 Approved by: https://github.com/awgu	2024-08-01 04:21:57 +00:00
Tianyu Liu	46ed33b207	add decomposition_table as an arg to get_isolated_graphmodule (#130886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130886 Approved by: https://github.com/wanchaol	2024-08-01 04:21:43 +00:00
Tugsbayasgalan Manlaibaatar	073430ebea	Don't check for autograd state when lowering to inference IR (#131988 ) When lowering to inference IR, we shouldn't error on autograd state changes because we will have preserved the autograd state change at the training level. I think the more correct way of implementing it would be to wrap autograd ops in HOP before decomposing, but that seems low ROI. Differential Revision: [D60346235](https://our.internmc.facebook.com/intern/diff/D60346235/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131988 Approved by: https://github.com/angelayi	2024-08-01 04:15:37 +00:00
Avik Chaudhuri	81db69278d	unsupported sympy functions in export solver (#132325 ) Summary: A bunch of issues around support for sympy functions like `TruncToInt` and `ToFloat` are uncovered by https://github.com/pytorch/pytorch/issues/131897. This PR addresses only one of them (as the title suggests). Another issue is deserialization, filed as a task: T197567691. However the most important issue is that adding runtime assertions is broken right now: specifically, sympy_interp with `PythonReferenceAnalysis` currently doesn't work because the implementations of some of these sympy functions in `PythonReferenceAnalysis` (or falling through to its base class) does not expect proxies. This means things like `math.trunc`, `math.floor`, `round`, etc. don't work, and can be easily repro'd by using them inside `torch._check`, e.g. According to ezyang these implementations need to point to new torch functions that can expect proxies (see how minimum and maximum are implemented, e.g.). Test Plan: added test (original repro provided) Differential Revision: D60540951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132325 Approved by: https://github.com/ezyang	2024-08-01 04:11:52 +00:00
PyTorch MergeBot	10344d76bd	Revert "[AOTI] Fix bfloat16 in CPU (#132150 )" This reverts commit a488113062b7231197ace8522ab3cab535c77d0b. Reverted https://github.com/pytorch/pytorch/pull/132150 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_unspec_inputs_cuda_dynamic_shapes_cuda_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10189155341/job/28189531216) [HUD commit link](`a488113062`). Test was not run on PR due to being skipped for being slow ([comment](https://github.com/pytorch/pytorch/pull/132150#issuecomment-2261895048))	2024-08-01 03:35:39 +00:00
PyTorch MergeBot	a28cda11ef	Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613 )" This reverts commit 344c15a0bb66409ec5e576992090d127cbfa2cff. Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))	2024-08-01 03:22:11 +00:00
YangQun1	589aef4bb0	Fix py codegen to delete values that don't have any users (#131028 ) Fixes #131025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028 Approved by: https://github.com/ezyang	2024-08-01 03:18:37 +00:00
rzou	718c13cd39	[inductor] Reinplacing should not allow an op to mutate the same input multiple times (#132238 ) Fixes #132196 Let's say we have: - op(x, y) that mutates both x and y - new_x, new_y = functional_op(x, y) is the functional variant If we are presented with functional_op(x, x), we must not reinplace this into op(x, x), because then it would be writing to the same Tensor. Instead, it's OK to reinplace one of them and to clone the other: ``` >>> y = x.clone() >>> op(x, y) ``` This also applies if we have views: functional_op(x, x[0]) should not reinplace into op(x, x[0]). The fix is to avoid reinplacing an arg if a view of it already has been reinplaced. Test Plan: - new and existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132238 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-08-01 02:37:03 +00:00
Alnis Murtovi	344c15a0bb	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison ghstack dependencies: #131610, #131611	2024-08-01 02:25:54 +00:00
Valentine233	2276d9045a	[cpu] add more VecConvert for 8bits (#131876 ) Adds more intrinsic specializations for 8bits conversions, in order to speed up bit8 SDPA in the future. - u8 -> i16 - i32 -> f32 - f32 -> i32 - i32 -> i8 (only add vec512 cause lack of avx512vl for vec256) - i16 -> i8 (only add vec512 cause lack of avx512vl for vec256) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131876 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-08-01 01:38:39 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
albanD	4e966e8a1c	Update inference_mode doc (#132321 ) Fix https://github.com/pytorch/pytorch/issues/132288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132321 Approved by: https://github.com/awgu, https://github.com/soulitzer	2024-07-31 23:50:03 +00:00
Shangdi Yu	a488113062	[AOTI] Fix bfloat16 in CPU (#132150 ) Fixes #122986 - add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file - Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare] 436 \| if (tensor.numel() != numel) { Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-07-31 23:28:24 +00:00
jainapurva	6b28af1b79	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-31 22:58:51 +00:00
eellison	f0da167ce5	Add fx graph runnable to tl parse (#130976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976 Approved by: https://github.com/ezyang	2024-07-31 22:19:35 +00:00
Oguz Ulgen	645c1052a6	Refactor local autotune remote cache to make the code less error prone (#132289 ) Fixes #132241 This PR refactors local autotune cache so that disabling it is easier and cleaner. Differential Revision: [D60537196](https://our.internmc.facebook.com/intern/diff/D60537196) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132289 Approved by: https://github.com/aorenste ghstack dependencies: #132285	2024-07-31 22:12:22 +00:00
Oguz Ulgen	b0e06d9d6a	Make config.autotune_remote_cache be a three-way option (#132285 ) Similar to fx_graph_cache config, make autotune config be three-way so we can hard enable/disable via config options. Differential Revision: [D60537105](https://our.internmc.facebook.com/intern/diff/D60537105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132285 Approved by: https://github.com/aorenste	2024-07-31 22:12:22 +00:00
Peter Bell	260c991e20	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-31 21:32:20 +00:00
Xuehai Pan	e74ba1b34a	[BE][Easy][15/19] enforce style for empty lines in import segments in `torch/_d*/` (#129767 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767 Approved by: https://github.com/anijain2305	2024-07-31 21:18:11 +00:00
Sheng Fu	ad9826208c	Remove string length limit in ET (#132169 ) Summary: ET sets the length limit of string input varaibele to 8192 characters. However, the node process_group::init has more than 8192 characters for a Ads 128 rank job. This DIFF is to temporaily remove this limit, so ET can capture the complete information of the process group. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTrace Reviewed By: sanrise Differential Revision: D60341306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132169 Approved by: https://github.com/sraikund16, https://github.com/sanrise	2024-07-31 20:54:39 +00:00
Alnis Murtovi	d3cefc9e3a	AutoHeuristic: Collect data for mixed_mm (#131611 ) This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things: Move pad_mm related AutoHeuristic files into subdirectory Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py). The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611 Approved by: https://github.com/eellison ghstack dependencies: #131610	2024-07-31 20:45:45 +00:00
Siddharth Kotapati	f8b6e91840	Add sequoia runner to mac-mps (#132190 ) Adds MacOS 15 runners to GitHub actions for Mac-mps test suite Co-authored-by: Joona Havukainen <jhavukainen@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132190 Approved by: https://github.com/malfet	2024-07-31 20:26:04 +00:00
Sergii Dymchenko	d72e863b3e	Fix lint after PR #130572 (#132316 ) Fix lint after https://github.com/pytorch/pytorch/pull/130572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132316 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi	2024-07-31 20:00:31 +00:00
Catherine Lee	aeb78c9849	[TD] More files for test_public_bindings (#132284 ) It relies on that file Also we care about .cpp files too apparently Pull Request resolved: https://github.com/pytorch/pytorch/pull/132284 Approved by: https://github.com/ZainRizvi	2024-07-31 19:53:40 +00:00
Andrii Grynenko	cb4c107d70	[pytorch][counters] DynamicCounter (#132166 ) Summary: Implement a callback-based dynamic counter with pluggable backends. The backend API and integration is similar to WaitCounter. Note that this counter should only be used with C++ callbacks, since making it safe to be used for GIL-requiring callbacks would be pretty challenging and may defeat the whole purpose of this counter (since the duration of the callback can no longer be guaranteed). Test Plan: unit test Differential Revision: D60464055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132166 Approved by: https://github.com/asiab4	2024-07-31 19:52:51 +00:00
PyTorch MergeBot	dc38646c58	Revert "[pytorch][counters] Pybind for WaitCounter (#132167 )" This reverts commit 2c7bd61afa4b762e00b26bbde43685de080af32a. Reverted https://github.com/pytorch/pytorch/pull/132167 on behalf of https://github.com/clee2000 due to broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183687967/job/28172929836) [HUD commit link](`2c7bd61afa`) not tested on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132167#issuecomment-2261328275))	2024-07-31 19:51:56 +00:00
Edward Z. Yang	6955bc170d	Some updates to merge rules (#132296 ) The added people from metamates don't actually make a material difference right now but I added some for fun. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132296 Approved by: https://github.com/albanD, https://github.com/malfet	2024-07-31 19:49:08 +00:00
Gabriel Ferns	2138a710eb	enable test_max_pool2d6 after resolving empty array (#132219 ) Related to Issue: https://github.com/pytorch/pytorch/issues/131335 Resolving PR: https://github.com/pytorch/pytorch/pull/132023 Test output: ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (enable-test-max-pool2d6)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cpu_cpp_wrapper.py -k test_max_pool2d6 inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] .inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] . ---------------------------------------------------------------------- Ran 2 tests in 8.668s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132219 Approved by: https://github.com/desertfire	2024-07-31 19:13:54 +00:00
drisspg	cfe61e84ac	Add a 'to' method for moving to and from device for BlockMask (#132087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132087 Approved by: https://github.com/yanboliang	2024-07-31 19:05:30 +00:00
Edward Z. Yang	898a431a46	Dump files that look like FX graphs to structured log (#132100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132100 Approved by: https://github.com/oulgen	2024-07-31 18:45:28 +00:00
James Wu	f9e4d05c15	Save and run post compilation steps within FXGraphCache (#130572 ) This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that: - When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification). - When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch. What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it returns exactly what compile_fx_inner returns. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner. ## What's a post compile step? We define a post-compile to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include: - Setting the tracing context's output strides - Running cudagraphs if enabled - Maybe realign inputs if cudagraphs didn't run To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object. ## Splitting cudagraphs work into pre/post compile Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages. Implementation notes: - We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness. - ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache doesn't have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so only inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result. - Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572 Approved by: https://github.com/eellison	2024-07-31 18:32:40 +00:00
JackCaoG	b40249b462	propagate XLA's metadata after functional sync (#131076 ) Fixes https://github.com/pytorch/xla/issues/7174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076 Approved by: https://github.com/bdhirsh	2024-07-31 18:20:00 +00:00
Joel Schlosser	7eb2a99585	Fix to support unary pointwise ops when an NJT is not the first arg (#131937 ) Background: NJT utilizes a `jagged_unary_pointwise()` fallback that historically has assumed blindly that the first arg is an NJT. This assumption breaks certain ops; for example `pow(scalar, Tensor)` has an NJT as the second arg. This PR expands `jagged_unary_pointwise()` and the associated schema validation logic to handle an NJT in args other than the first position. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131937 Approved by: https://github.com/soulitzer ghstack dependencies: #131898, #131704	2024-07-31 17:51:03 +00:00
Michael Lazos	c3a31d90e7	Fix inlining module-scoped store global (#132224 ) Fixes https://github.com/pytorch/pytorch/issues/132165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132224 Approved by: https://github.com/anijain2305	2024-07-31 17:37:43 +00:00
Aaron Orenstein	6214b5388b	typing ir.py - part 1 (#131845 ) See #131852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131845 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-31 17:37:14 +00:00
Michael Lazos	144639797a	Improve side effects error message (#132223 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/132223 Approved by: https://github.com/anijain2305	2024-07-31 17:29:26 +00:00
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit 13d744464f10e35c0de50feb4e2340d4dae8e05f. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
Sam Larsen	9826c542f0	[inductor] skip remote fx caching in failing pattern matcher tests (#132206 ) Summary: These tests are failing internally with remote caching enabled because the installed pattern increments a nonlocal counter, which we skip with a cache hit. Test Plan: ``` buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_with_mutation (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations1 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations2 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations3 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 ``` Differential Revision: D60491503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132206 Approved by: https://github.com/oulgen	2024-07-31 16:41:04 +00:00
datagero	bdd7a0322d	[Dynamo] Fix - `str` handler for UserDefinedObjectVariable (#130506 ) Fixes #130301 Adjusted the call_str method to handle str conversion for UserDefinedObjectVariable. Attempt in a clean branch for unrelated test errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130506 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2024-07-31 16:39:59 +00:00
Yan Zhiwei	fe4f8e97cd	[Intel GPU] xpu-ops codegen via backend whitelist (#130082 ) # Motivation This PR intends to enhance the codegen to allow generate codes for XPU backend. XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts. We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes. # Usage XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops` We use the following commands to generate XPU operators ` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU` The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen. The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten` # Result All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder. # Verification * In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #130019	2024-07-31 16:31:38 +00:00
David Berard	aec8bc5e4c	[easy] fix type annotation on constraint_violations variable (#127064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127064 Approved by: https://github.com/jananisriram	2024-07-31 16:27:10 +00:00
hongxiayang	c85088b1f9	[ROCm] performance optimization for index select (#131713 ) As observed during working on this fix (https://github.com/pytorch/pytorch/pull/130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance. By increasing the default max threads per block from 128 to 256, I saw for `aten::index_select`, its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script: ``` input = torch.randint(low=0, high=16032, size=[131072], device="cuda") w = torch.randn([16032, 16384], device="cuda") with profiler.profile(record_shapes=True) as prof: x = torch.nn.functional.embedding(input, w) ``` I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio: 128 \| 1x 256 \| 1.33x 512 \| 1.44x 1024 \| 1.49x Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script. Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster. Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131713 Approved by: https://github.com/jeffdaily, https://github.com/syed-ahmed, https://github.com/malfet	2024-07-31 16:24:01 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
Andrii Grynenko	2c7bd61afa	[pytorch][counters] Pybind for WaitCounter (#132167 ) Summary: Basic pybind integration for WaitCounter providing a guard API. Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API). Test Plan: unit test Reviewed By: asiab4 Differential Revision: D60463979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132167 Approved by: https://github.com/asiab4	2024-07-31 16:04:40 +00:00
Xu Han	39a3c98aa6	[inductor] fix scalar miss constuctor for long type. (#132117 ) Fix `long` to `c10::scalar` convert issue. ![image](https://github.com/user-attachments/assets/fc44a170-e293-4688-a185-d189484f6638) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132117 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-31 15:40:48 +00:00
Ke Wen	b2118573d6	[BE] Unify PG assignments (#132230 ) python's `or` operator returns `bar` in cases of `foo = None or bar` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132230 Approved by: https://github.com/Skylion007, https://github.com/wconstab	2024-07-31 15:28:25 +00:00
IvanKobzarev	9c52013559	[subclasses] Fix nested subclasses flattened tensors ordering (#132096 ) get_plain_tensors() should result in DFS of leaves. The error was that plain tensors (leaves) on the same level were returned before subclasses plained tensors even if subclasses are before in "flatten" list. Original issue from AO: https://github.com/pytorch/ao/issues/515 Test:TBD, need to make asymetric subclass with dense tensors and subclasses Pull Request resolved: https://github.com/pytorch/pytorch/pull/132096 Approved by: https://github.com/bdhirsh	2024-07-31 14:12:51 +00:00
PyTorch MergeBot	5406e46b00	Revert "Add fx graph runnable to tl parse (#130976 )" This reverts commit 52c3af62d6fa4a0a4e22764a89f1877f3b1b28f9. Reverted https://github.com/pytorch/pytorch/pull/130976 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/130976#issuecomment-2260579485))	2024-07-31 13:53:57 +00:00
Ke Wen	3d7f541597	[BE][TP] Check module has bias before access (#132137 ) Some linear modules, such as the ones reconstructed by `torch.export.unflatten()`, may not have the `bias` attribute, if the original linear module has `bias=None`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132137 Approved by: https://github.com/wanchaol	2024-07-31 13:45:28 +00:00
Dan Zimmerman	dad125a64b	Address clang-tidy nits in BFloat16 (#132203 ) Summary: In https://github.com/pytorch/pytorch/pull/131359 I forgot to amend with clang-tidy fixes before merging. This addresses that. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/132203 Approved by: https://github.com/houseroad	2024-07-31 13:41:56 +00:00
Yu, Guangye	45e6a364ee	Avoid autocast deprecation warning (#132207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207 Approved by: https://github.com/awgu	2024-07-31 13:13:39 +00:00
Luca Wehrstedt	f4f7aba75d	Expose function to probe whether PyTorch was built with FlashAttention (#131894 ) This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-07-31 11:33:09 +00:00
Xuehai Pan	548c460bf1	[BE][Easy][7/19] enforce style for empty lines in import segments in `test/[a-c]/` and `test/[q-z]/` (#129758 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129758 Approved by: https://github.com/ezyang	2024-07-31 10:54:03 +00:00
Janani Sriram	46994e753b	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#132172 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132172 Approved by: https://github.com/davidberard98 ghstack dependencies: #132170	2024-07-31 10:51:46 +00:00
Janani Sriram	89053e382a	[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#132170 ) Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132170 Approved by: https://github.com/davidberard98	2024-07-31 10:51:46 +00:00
Xuehai Pan	e7eeee473c	[BE][Easy][14/19] enforce style for empty lines in import segments in `torch/_[a-c]/` and `torch/_[e-h]/` and `torch/_[j-z]*/` (#129765 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765 Approved by: https://github.com/ezyang	2024-07-31 10:42:50 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00
IvanKobzarev	a94e507c39	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Original issue: https://github.com/pytorch/pytorch/issues/114338 Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-31 07:25:19 +00:00
Yiming Zhou	e9d1c26275	fix uniform op in dynamo (#132160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132160 Approved by: https://github.com/anijain2305	2024-07-31 06:48:43 +00:00
Justin Chu	ae708e9791	[ONNX] Remove the deprecated SymbolicContext (#132184 ) Remove the deprecated SymbolicContext class from torch.onnx Pull Request resolved: https://github.com/pytorch/pytorch/pull/132184 Approved by: https://github.com/titaiwangms	2024-07-31 04:24:32 +00:00
cyy	89da94594e	[11/N] Fix clang-tidy warnings in jit (#132131 ) Follows #132122 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132131 Approved by: https://github.com/Skylion007	2024-07-31 03:45:52 +00:00
PyTorch MergeBot	91299c95ec	Revert "Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 )" This reverts commit 78020ea55d1bc06898577887b80c15d6d2b967dc. Reverted https://github.com/pytorch/pytorch/pull/131288 on behalf of https://github.com/kit1980 due to Broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10172945925/job/28136657243) [HUD commit link](`78020ea55d`) ([comment](https://github.com/pytorch/pytorch/pull/131288#issuecomment-2259581854))	2024-07-31 03:45:09 +00:00
Cheng Ni	27c9262d29	Fix stdout / stderr typing in SubprocessHandler (#132071 ) Summary: Fix stdout / stderr typing in SubprocessHandler. Stdout and Stderr should be `Optional[str]` instead of `str`. Test Plan: CI Differential Revision: D60319648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132071 Approved by: https://github.com/Skylion007	2024-07-31 02:51:11 +00:00
eellison	52c3af62d6	Add fx graph runnable to tl parse (#130976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976 Approved by: https://github.com/ezyang	2024-07-31 02:27:22 +00:00
Matthew Hoffman	deb788f6cc	Merge `torch.nn.utils.rnn` type stubs (#131872 ) I want to re-attempt: * #61467 See: * https://github.com/pytorch/pytorch/issues/10536#issuecomment-2251948730 and this is one of the files I would touch. quoting @ezyang: * https://github.com/pytorch/pytorch/issues/91648#issuecomment-1372010129 > The back story here is that in https://github.com/pytorch/pytorch/pull/19089 we added pyi stubs for nn modules, but when we got off Python 2 we started merging the pyi stubs directly into the py files, e.g., as in https://github.com/pytorch/pytorch/pull/43044. But not all the modules got the treatment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131872 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-07-31 02:24:59 +00:00
Matthew Hoffman	78020ea55d	Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 ) Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error: ``` "mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage] ``` Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288 Approved by: https://github.com/ezyang	2024-07-31 02:16:38 +00:00
Cui, Yifeng	df0494bbba	Clean redundant link libraries for XPU (#131322 ) `torch_xpu` should link to `libtorch_cpu.so` instead of `torch_cpu_library`, otherwise redundant link libraries will contaminate `torch_xpu`, especially when there are MKL in both CPU and XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131322 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-07-31 02:15:15 +00:00
Xuehai Pan	c07aa1c9c9	[Easy] reorder functions in `torch._jit_internal` (#130531 ) Split from #128633. - #128633 Move commonly used functions (e.g. `is_scripting`) to the top of the module to avoid circular dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130531 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-07-31 02:12:29 +00:00
Xuehai Pan	fbe6f42dcf	[BE][Easy][8/19] enforce style for empty lines in import segments in `test/[k-p]*/` (#129759 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129759 Approved by: https://github.com/justinchuby, https://github.com/ezyang	2024-07-31 02:09:20 +00:00
atalman	914577569d	Remove python 3.8 nightly builds (#132138 ) Removing python 3.8 support in nightly builds. As per PR: https://github.com/pytorch/pytorch/issues/120718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132138 Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/huydhn	2024-07-31 01:50:03 +00:00
Anshul Sinha	05317cd8f7	[dtensor][be] improving readability and reducing repeating code (#132070 ) Summary I created functions that reduced repeating code in the console and json APIs which also improved their readability for future developers. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/132070 Approved by: https://github.com/XilunWu	2024-07-31 00:53:36 +00:00
Tianyu Liu	f85feef127	[DTensor] add support for custom op registration (#131108 ) `register_sharding` is an experimental API that allows users to register sharding strategies for an operator when the tensor inputs and outputs are :class:`DTensor`s. It can be useful when: (1) there doesn't exist a default sharding strategy for ``op``, e.g. when `op` is a custom operator that is not supported by `DTensor`; (2) when users would like to overwrite default sharding strategies of existing operators. Here's an example: @register_sharding(aten._softmax.default) def custom_softmax_sharding(x, dim, half_to_float): softmax_dim = dim if dim >= 0 else dim + x.ndim acceptable_shardings = [] all_replicate = ([Replicate()], [Replicate(), None, None]) acceptable_shardings.append(all_replicate) for sharding_dim in range(x.ndim): if sharding_dim != softmax_dim: all_sharded = ( [Shard(sharding_dim)], [Shard(sharding_dim), None, None], ) acceptable_shardings.append(all_sharded) return acceptable_shardings Pull Request resolved: https://github.com/pytorch/pytorch/pull/131108 Approved by: https://github.com/wanchaol	2024-07-31 00:51:16 +00:00
leslie-fang-intel	31205d5198	[Inductor][CPP] Fix Local Buffer issue with inplace result line (#132018 ) Summary If a `global buffer` has been replaced by `local buffer`, we will add this `global buffer` into `removed_buffers` to avoid unnecessary allocation. However, a special case is when this `global buffer` can reuse previous buffer. We didn't handle this case previously which cause functional failure in `f151f25c0b/torch/_inductor/codegen/wrapper.py (L440)` In this PR, we resolve this issue by avoid adding this global buffer into `V.kernel.inplace_update_buffers` when this buffer has been marked as `removed`. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_local_buffer_with_line_reuse ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132018 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-31 00:38:17 +00:00
Siyu Yang	882d80fd92	Add lowering for updated _scaled_mm (fixing submodules) (#130422 ) Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in https://github.com/pytorch/pytorch/pull/128683. The lowering does: - for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations. - for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in https://github.com/pytorch/pytorch/pull/125204) and Triton kernel configurations. The Triton kernel template is based on `3ad9031d02` (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py` ## Testing: - Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types. - Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast: - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row' - P1477224245 - 2 kernels - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row' - P1477227340 - 2 kernels - UT `python test/inductor/test_fp8.py -- TestFP8Lowering` ## Benchmarking Eager/compiled tensor-wise/row-wise scaling for various shapes: https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669 - Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance. Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes: https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446 ## Questions for reviewers: - Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)? ## Todo: - Make the Triton template use the improved persistent kernel version (https://github.com/pytorch/FBGEMM/pull/2735 by @htyu) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130422 Approved by: https://github.com/ipiszy	2024-07-30 23:48:48 +00:00
Menglu Yu	fdcd2f0dd1	[PT2][Optimus] Add unbind cat to view pass (#132152 ) Summary: We observed new graph transformation opportunity in IG_CTR, which can further remove the cat node. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/5061a3fe-b788-4031-b3af-66d48564a2df Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199298289131 Network: Up: 2.5GiB Down: 5.7GiB (reSessionID-a49b1234-c02c-4a2d-a9ad-9f5b23557522) Jobs completed: 294061. Time elapsed: 13:47.8s. Cache hits: 68%. Commands: 106996 (cached: 72904, remote: 33875, local: 217) Tests finished: Pass 10. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 1649, 'pattern_matcher_count': 1538, 'normalization_pass': 343, 'extern_calls': 160, 'normalization_aten_pass': 39, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 9, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1}) before vs after graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1497865201 Differential Revision: D60325668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132152 Approved by: https://github.com/jackiexu1992	2024-07-30 23:27:18 +00:00
Edward Z. Yang	afb04d78c8	Don't try hard to compute alignment of unbacked expressions (#131649 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131649 Approved by: https://github.com/bdhirsh	2024-07-30 23:19:42 +00:00
Yifu Wang	5a33657b31	[micro_pipeline_tp] implement the pass for fused_scaled_matmul_reduce_scatter (#131951 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131951 Approved by: https://github.com/weifengpy	2024-07-30 23:02:49 +00:00
Joel Schlosser	524aac413c	Initial OpInfo-based testing for NJTs (#131704 ) This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing. * New tests in `TestNestedTensorOpInfo` * `test_forward()` - compares forward output to an unbind-based reference * `test_backward()` - compares forward output and grads to an unbind-based reference * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager * To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`. * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim * Several xfails were added to get things passing TODO (future PRs): * Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today) * Mixed (NT, T), (T, NT) inputs for binary ops * Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs * Address all xfails via fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704 Approved by: https://github.com/soulitzer ghstack dependencies: #131898	2024-07-30 23:02:24 +00:00
Roy Berger	93facac02c	[NeuralNetInference] Bring up iOS builds (#131917 ) Summary: Mirror Android setup to static link & use lite interpreter on iOS Test Plan: CI Reviewed By: EscapeZero Differential Revision: D60156611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131917 Approved by: https://github.com/cccclai	2024-07-30 23:01:09 +00:00
Wanchao Liang	53a5e0f1a8	[BE] delete spmd module (#132072 ) Summary: as titled, fully delete spmd module as we stopped working on this and the code is already broken with no unit tests enabled. We should not keep it in the codebase as it provide no value anymore, and it burdens DTensor to maintain the compatiblity with it (i.e. code paths/imports) constantly. Test Plan: sandcastle Differential Revision: D60402105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132072 Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/fegin, https://github.com/seemethere, https://github.com/albanD, https://github.com/yifuwang	2024-07-30 22:20:21 +00:00
Songhao Jia	a141334c88	migitate wrong tensor.dim_order() (#131366 ) Summary: there're some issues for dim order creation. T194410923 has detail illustration. One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut. To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it. Test Plan: CI Differential Revision: D60056793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366 Approved by: https://github.com/ezyang	2024-07-30 21:58:15 +00:00
Andrew Gu	2b43fab555	[DTensor] Added naive support for `nn.init.orthogonal_` (#132104 ) Try to unblock https://github.com/pytorch/pytorch/issues/131991 - `nn.init.orthogonal_` uses `tensor.new`, which is the legacy factory function. We change this to `tensor.new_empty` (empty is okay since it will be immediately followed by `.normal_()` to fill the tensor) so that it preserves `DTensor`-ness. - `nn.init.orthogonal_` uses QR decomposition (`aten.linalg_qr.default`) and `torch.diag` (calling into `aten.diagonal_copy.default`). For simplicity, we use naive replicate strategies for now. `aten.diagonal_copy.default` could do something more sophisticated for sharded inputs, but I would rather defer that to later due to the complexity. For `orthogonal_` support specifically, since the result of the QR decomp will be replicated, the input to `aten.diagonal_copy.default` will be replicated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132104 Approved by: https://github.com/albanD, https://github.com/wanchaol	2024-07-30 21:55:09 +00:00
Zain Rizvi	3e142d766a	[EZ] Make consistent with scale-config.yml (#132164 ) Fix inconsistencies from test-infra's scale-config.yml file To be followed up by https://github.com/pytorch/test-infra/pull/5513 which will catch such inconsistencies going forward Pull Request resolved: https://github.com/pytorch/pytorch/pull/132164 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/zxiiro	2024-07-30 21:42:23 +00:00
Lucas Pasqualin	69c34f6e4c	Corrects Error Codes from cudaHostRegister (#132089 ) Causing some terrible error messages e.g. : ``` # printing directly: cudaError.??? # casting to int first: 712 Traceback (most recent call last): File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 15, in <module> main() File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 11, in main _create_cpu_state_dict(sd, share_memory=True, pin_memory=True) File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 436, in _create_cpu_state_dict ret = _iterate_state_dict( ^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 143, in _iterate_state_dict ret = { ^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 144, in <dictcomp> key: _iterate_state_dict( ^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 125, in _iterate_state_dict ret = tensor_func(iter_object, pg, device, companion_obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 428, in tensor_func succ == 0 AssertionError: Pinning shared memory failed with error-code: cudaError.??? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132089 Approved by: https://github.com/Skylion007	2024-07-30 21:42:00 +00:00
Jiashen Cao	ff377e16ab	Improve logging in the TSConverter (#132082 ) Summary: Currently, running explain with TORCH_LOGS enabled will cause duplicate loggings because explain uses the exact same code path for covnersion. This PR just disables logging when it is running explain. And move all logging to convert() to prevent from logging from __init__ when we are just using explain. Test Plan: Manual testing with attached outputs. Reviewed By: SherlockNoMad, angelayi Differential Revision: D60199007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132082 Approved by: https://github.com/ydwu4	2024-07-30 21:37:44 +00:00
Edward Z. Yang	495d413519	Include code object of frame being compiled in stack (#132161 ) This is pretty useful to have! Test plan: https://internalfb.com/intern/fblearner/details/586653862/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132161 Approved by: https://github.com/oulgen	2024-07-30 21:33:27 +00:00
rzou	19db4f6014	[capture_triton] fix special kwargs path (#132143 ) I didn't test this path when creating the orchestrator. This PR fixes that path to work in the capture_triton path. The problem is that we are handling a value that is an int (in the capture_triton path) and a ConstantVariable (in the Dynamo triton path) so we abstract that out in the orchestrator. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132143 Approved by: https://github.com/oulgen	2024-07-30 20:30:40 +00:00
Xintong Hu	1118c74b5f	[PT2] Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes (#131902 ) (#132078 ) Summary: Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes Test Plan: run new UTs Reviewed By: frank-wei Differential Revision: D60258724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132078 Approved by: https://github.com/frank-wei	2024-07-30 20:17:06 +00:00
Joel Schlosser	d53b11bb6e	Strict shape checking for NJTs with TestCase.assertEqual() (#131898 ) Background: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal. This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal. Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898 Approved by: https://github.com/soulitzer	2024-07-30 20:05:48 +00:00
Shuai Yang	58f76bc301	Revise skip torchrec logic (#130783 ) Summary: The previous logic adds skipped files when the file was imported which happens at very early stage. However, we could set skip_torchrec at later stage (e.g, in APS, we set it during the trainer execution). In that case, the skip logic will still take effect since skipped files have been added. So in this diff, we revise the logic so that it can adapt to changes of skip_torchrec at later stages. Test Plan: Tested on APS models: buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher_live -- mode=local_ig_fm_uhm_mini model_name=ig_fm_one_sparse_benchmark features=ig_fm_one_sparse_benchmark model=ig_fm_one_sparse_benchmark training.pipeline_type=pt2 commit: 2fb485d9e torchrec related paths were not skipped. Differential Revision: D59779153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130783 Approved by: https://github.com/yanboliang	2024-07-30 19:55:20 +00:00
Li-Huai (Allan) Lin	964f97539f	[MPS] Correct nonzero warning and fix the test (#132127 ) #125355 lifted the natively supported macOS version to 14. Fixes #132110 Probably fixes this flaky test disabling issue: #126492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132127 Approved by: https://github.com/malfet	2024-07-30 19:46:25 +00:00
Edward Z. Yang	f2dedc910e	Improve SpeculationLog error message (#131982 ) There are some substantive changes. Instead of recording the next instruction in the speculation log, I record the current instruction. I think this is more intuitive, we always call speculation at the beginning of executing an instruction, so logically, the entry is associated with the current instruction. (Note that self.instruction_pointer is next instruction, as conventionally we increment IP before calling speculate). The cosmetic change is to also pass in the Instruction corresponding to the IP and print it, and beef up the error message, including notes about the previous instruction that was run before it failed (this is typically the critical instruction). At time of submission, this test case triggered the error: ``` diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py index 5ade17856e1..60ef89be346 100644 --- a/test/distributed/test_dynamo_distributed.py +++ b/test/distributed/test_dynamo_distributed.py @@ -844,6 +844,39 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase): for r in res[1:]: self.assertEqual(res[0], r) + @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch") + @config.patch(enable_compiler_collectives=True) + def test_compiler_collectives_automatic_dynamic_speculation_divergence(self): + with _dynamo_dist_per_rank_init(self.rank, self.world_size): + torch._dynamo.utils.clear_compilation_metrics() + + # TODO: This should be possible to do inside the function, but + device = f"cuda:{self.rank}" + + @torch.compile() + def f(x, y): + zx = x.shape + zy = y.shape + return x.sum() + y.sum() + + if self.rank == 0: + dataloader = [4, 4] + else: + dataloader = [3, 4] + + for data in dataloader: + f( + torch.randn(data, device=self.rank), + torch.randn(data, device=self.rank), + ) + + metrics = torch._dynamo.utils.get_compilation_metrics() + # Number of compiles same on all nodes + res = [None] * self.world_size + torch.distributed.all_gather_object(res, len(metrics)) + for r in res[1:]: + self.assertEqual(res[0], r) + @requires_nccl() ``` although I plan to fix this soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131982 Approved by: https://github.com/anijain2305, https://github.com/mlazos, https://github.com/jansel	2024-07-30 19:21:31 +00:00
Joel Schlosser	e6cddc9271	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-30 18:42:54 +00:00
Aidyn-A	f217b470cc	[CMAKE] Avoid double setting of LDFLAGS (#130370 ) It was observed that in some environments `LDFLAGS` gets directly appended to `CMAKE_SHARED_LINKER_FLAGS`. As the result, the same linker flag can appear twice in `CMAKE_SHARED_LINKER_FLAGS` due to manual set: `1bf4a44b33/CMakeLists.txt (L541-L542)` This flag collision causes the build failures at the `cmake` stage. This PR adds an instruction to `CMakeLists.txt` to avoid double setting of `LDFLAGS` into `CMAKE_SHARED_LINKER_FLAGS`. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130370 Approved by: https://github.com/atalman, https://github.com/tinglvv, https://github.com/malfet	2024-07-30 18:16:04 +00:00
Jane Xu	3816f6420a	[BE] remove unnecessary _dispatch_sqrt by using 0.5 (#131358 ) Based on the discussion here where 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358 Approved by: https://github.com/albanD	2024-07-30 18:08:17 +00:00
Aos Dabbagh	9f6d7df3d9	docs(multinomial): Add reference to `Multinomial` class (#131904 ) This PR just adds the reference to the class `torch.distributions.multinomial.Multinomial` in `torch.multinomial`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131904 Approved by: https://github.com/jbschlosser	2024-07-30 18:05:07 +00:00
PyTorch MergeBot	239d4d2489	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 9606d61e0c921b886d20cb61454043c6c270ae89. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/ZainRizvi due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2258871791))	2024-07-30 17:39:41 +00:00
Tristan Rice	9027db1ab8	TCPStore: fix remote address (#131773 ) (#131913 ) Summary: This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. This relands it since it got reverted due to a fmt::format issue internally. Original Pull Request: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman Test Plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v buck2 test @//mode/dev-nosan //caffe2/test/distributed:store ``` Differential Revision: D60296583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131913 Approved by: https://github.com/kurman, https://github.com/rsdcastro, https://github.com/Skylion007	2024-07-30 17:27:33 +00:00
Florian	3864a2d834	[profiler ut] Update event name in test_profiler.py (#131757 ) Fixes #ISSUE_NUMBER To support kernel name with some uppercase letters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131757 Approved by: https://github.com/aaronenyeshi	2024-07-30 17:15:31 +00:00
Yidi Wu	32c57e78ed	Specialize sym node when used as device kwarg (#131811 ) Fixes https://github.com/pytorch/pytorch/issues/131189. We specialize the symint in python_arg_parser when used as kwarg device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131811 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/albanD	2024-07-30 17:11:57 +00:00
Andrew Gu	33ce9cf7f9	[FSDP2] Relaxed overlap timing check to avoid flakiness (#132116 ) Trying to fix https://github.com/pytorch/pytorch/issues/131081 See https://github.com/pytorch/pytorch/issues/131081#issuecomment-2239443504 for detailed context. This PR is relaxing one assertion against the _baseline_ to try to fix the flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132116 Approved by: https://github.com/Skylion007	2024-07-30 14:28:12 +00:00
Jeeja	16e0868a3d	[FSDP] Add hpu device to _get_remote_device_str (#132120 ) In _creating chunk_sharded_tensor, _get_remote_device_str is used. by default it uses the node cound to determine the device:instance. for hpu, need to use current device to get the deivce_instance. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132120 Approved by: https://github.com/awgu	2024-07-30 14:24:24 +00:00
Guilherme Leobas	a843178529	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519	2024-07-30 14:22:23 +00:00
Shreyans Pathak	12b67bd998	Fix pyi annotation for `ProcessGroupGloo.Options` (#132080 ) This PR fixes the pyi annotation for `ProcessGroupGloo.Options` based on the definition in the `torch/csrc/distributed/c10d/init.cpp` file. Fixes #132054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132080 Approved by: https://github.com/Skylion007	2024-07-30 13:52:31 +00:00
PyTorch MergeBot	499ead96ff	Revert "Grouped Query Attention (#128898 )" This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))	2024-07-30 13:11:24 +00:00
cyy	bdf57da6a6	[3/N] Enable clang-tidy on torch/csrc/inductor (#132101 ) Follows #132040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132101 Approved by: https://github.com/Skylion007	2024-07-30 13:04:57 +00:00
cyy	eccbd408e5	[10/N] Fix clang-tidy warnings in jit (#132122 ) Follows #132010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132122 Approved by: https://github.com/Skylion007	2024-07-30 12:56:31 +00:00
Sijia Chen	83db609ee5	[inductor] fix the cudagraph tree test (#132043 ) Summary: There are two kinds of exceptions: Case #1: ``` static input data pointer changed. input name: primals_2. data pointer changed from 140315748992000 to 140315748993536. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1826, in forward return self.static_tensor + x + self.goo(x) File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1816, in forward return self.linear(x) input name: primals_3. data pointer changed from 140315748990976 to 140315748993024. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward self.static_tensor.add_(torch.ones((2, 2), device="cuda")) ``` Case #2: ``` static input data pointer changed. input name: primals_2. data pointer changed from 139852509086720 to 139852509088256. input stack trace: None input name: primals_3. data pointer changed from 139852509085696 to 139852509087744. input stack trace: File "/dev/shm/uid-30083/f61ee184-seed-nspid4026560782_cgpid769179-ns-4026560865/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward self.static_tensor.add_(torch.ones((2, 2), device="cuda")) ``` The current impl only covered the case #2 Test Plan: https://www.internalfb.com/intern/testinfra/testrun/15481123762274476 Differential Revision: D60340212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132043 Approved by: https://github.com/BoyuanFeng	2024-07-30 08:35:56 +00:00
Menglu Yu	36e8289129	[PT2][Optimus] Optimize cat node inputs pattern (#131866 ) Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes ``` # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 1589, 'pattern_matcher_count': 1497, 'extern_calls': 393, 'normalization_pass': 342, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 12, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1}) P1496150856 Differential Revision: D60274533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131866 Approved by: https://github.com/jackiexu1992	2024-07-30 07:49:26 +00:00
Yanbo Liang	54d4f6bbca	[Inductor][FlexAttention] Correct partial/full blocks naming (#131993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131993 Approved by: https://github.com/drisspg	2024-07-30 06:40:40 +00:00
Animesh Jain	03e058189e	[dynamo] Support dict unpack of MutableMapping objects (#131961 ) Fixes https://github.com/pytorch/pytorch/issues/128067 The basic functionality was alredy introduced earlier. This just ensures that we support UserDefinedObjectVariable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131961 Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/yanboliang ghstack dependencies: #131827, #131956	2024-07-30 05:49:58 +00:00
Animesh Jain	f806128619	[dynamo] Skip <frozen abc> to skip __isisintance__ check on abc objects (#131956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131956 Approved by: https://github.com/williamwen42, https://github.com/mlazos ghstack dependencies: #131827	2024-07-30 05:49:58 +00:00
Animesh Jain	13457d1da0	[dynamo][log] Suggest to use pytree when graph-break on optree (#131827 ) Discovered while working on https://github.com/pytorch/pytorch/issues/121369 On the model above, the log looks like this ~~~ /home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree._C.PyCapsule.flatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py. torch._dynamo.utils.warn_once(msg) /home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree.PyCapsule.unflatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py. torch._dynamo.utils.warn_once(msg) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131827 Approved by: https://github.com/zou3519, https://github.com/mlazos	2024-07-30 05:49:58 +00:00
Jiang, Yanbing	fc6066b80f	improve mkldnn_linear_pointwise_binary performance for contiguous tensor with non default contiguous strides (#132019 ) Fixes https://github.com/pytorch/pytorch/issues/131734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132019 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-07-30 05:02:38 +00:00
PyTorch UpdateBot	40f8db5741	[audio hash update] update the pinned audio hash (#132105 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132105 Approved by: https://github.com/pytorchbot	2024-07-30 03:39:27 +00:00
Xu Han	aa1488fe02	[inductor] turn on enable_kernel_profile on Windows. (#132025 ) Enable `TORCHINDUCTOR_CPP_ENABLE_KERNEL_PROFILE` on Windows inductor. Local tested pass: ![image](https://github.com/user-attachments/assets/a82351af-cc56-4ba1-a8f4-08f1c38713d1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132025 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 03:02:09 +00:00
Xu Han	475da800c7	[inductor] optimize cflags for Windows. (#131980 ) changes: 1. optimize cflags for Windows. Ref: https://github.com/pytorch/pytorch/blob/v2.4.0/torch/utils/cpp_extension.py#L215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131980 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:59:51 +00:00
Xu Han	bdc42e3fb8	[inductor] validate_can_generate_cpp_wrapper add win32 support. (#131978 ) Changes: 1. `validate_can_generate_cpp_wrapper` add win32 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131978 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:59:48 +00:00
eellison	baa4c9ca46	Optimize aten.cat calls of a repeated element (#132081 ) This was a particular problem for a model I saw which would have a large number of repeats, making compilation slow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132081 Approved by: https://github.com/shunting314	2024-07-30 02:56:00 +00:00
leslie-fang-intel	f8e4060484	[Inductor][CPP] Enhance cppcsevar data type deduce (#130827 ) Summary Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](`096dc444ce/torch/_inductor/codegen/common.py (L1844)`)), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction: - We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`. - To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:51:31 +00:00
William Wen	b6c1490cc0	[dynamo] make more unpack_var_sequence calls forced (#132069 ) Fixes [T197204962](https://www.internalfb.com/intern/tasks/?t=197204962) (example failure: https://www.internalfb.com/intern/testinfra/diagnostics/11540474088277914.281475138576374.1722221031/) Added tests contain a simple repro for the observed failure (`test_map_unpack_vars`). Also fixes https://github.com/pytorch/pytorch/issues/132044 Differential Revision: [D60420335](https://our.internmc.facebook.com/intern/diff/D60420335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132069 Approved by: https://github.com/anijain2305	2024-07-30 02:30:08 +00:00
Aaron Orenstein	8721b21b38	Fix fake_tensor w/ non-view tensor (#132050 ) Summary: This code was overly complex and is confusing some guards - basically if a result cached tensor isn't a view there's no reason to be messing with its storage. Test Plan: unit tests pass Differential Revision: D60387821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132050 Approved by: https://github.com/oulgen	2024-07-30 02:17:18 +00:00
eellison	9598c58618	Add config option to skip autotuning conv (#131839 ) requested internally bc for some models the conv templates are not very helpful Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839 Approved by: https://github.com/oulgen ghstack dependencies: #131400	2024-07-30 01:57:53 +00:00
zhouyusong	5a2620302b	[inductor] Replace self_cuda_time_total function calls with self_dev… (#131029 ) …ice_time_total for wrapper_bench Pull Request resolved: https://github.com/pytorch/pytorch/pull/131029 Approved by: https://github.com/shunting314	2024-07-30 01:57:39 +00:00
Li-Huai (Allan) Lin	a147fa577b	[MPS] Fix masked_fill_ in non_contiguous cases (#131957 ) fixes #131285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131957 Approved by: https://github.com/DenisVieriu97	2024-07-30 01:34:48 +00:00
blaine-rister	3716934b1a	[Inductor] Refactor autotuning utils to compute max block sizes (#131730 ) These OSS changes are part of a larger MTIA diff. The OSS part is a simple refactor that makes it easier to query max block sizes by the prefix of the grid dimension, e.g. `"X"`, as opposed to having to use separate functions for `get_xmax()`, `get_ymax()`, etc. Differential Revision: D60195669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131730 Approved by: https://github.com/eellison	2024-07-30 01:04:53 +00:00
PyTorch MergeBot	7a7dd8c29e	Revert "[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518 )" This reverts commit bcf5c68c18c6a109e1fa00829eea0428d44cfb6b. Reverted https://github.com/pytorch/pytorch/pull/131518 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit (the final PR and diff must always be identical). Conflicts arise when that happens which block the diff train. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131518#issuecomment-2257259839))	2024-07-30 00:55:10 +00:00
angelayi	ab9791c0e3	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-07-30 00:41:44 +00:00
eellison	2a4d9aa548	Disable expandable segments checkpointing internally (#132048 ) Differential Revision: [D60388286](https://our.internmc.facebook.com/intern/diff/D60388286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132048 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-07-30 00:26:39 +00:00
PyTorch MergeBot	be5e44192d	Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 )" This reverts commit 8fe2bf212dc5e01b15cbe728958f940873230d64. Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit. Weird conflicts arise when that happens. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2257230717))	2024-07-30 00:18:22 +00:00
Bin Bao	b1ccd0c407	[CI] Update environment varible setting for aarch64 (#132046 ) Summary: JEMALLOC_LIB and core_number need to be set differently on aarch64. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132046 Approved by: https://github.com/huydhn	2024-07-30 00:09:59 +00:00
yuqingj	e3dc20c94b	[NJT] support cat backward (#132076 ) cat_tensors_backward use narrow_symint, so we need to support aten::narrow for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132076 Approved by: https://github.com/davidberard98	2024-07-29 23:49:26 +00:00
Yuzhen Huang	5298acb5c7	Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132065 ) Summary: Original commit changeset: 1d8cfdcef69d Original Phabricator Diff: D54134695 back out: D54134695 Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc Reviewed By: zw2326 Differential Revision: D60397377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065 Approved by: https://github.com/zw2326, https://github.com/qchip	2024-07-29 22:48:29 +00:00
eellison	8b507a922a	Mode to emulate amp numerics (#131595 ) ``` # Mode to emulate pytorch eager numerics for lower precision (fp16, bf16) # Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after # For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts # Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging # to emulate the eager numerics. ``` We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching. in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now. This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595 Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel	2024-07-29 22:42:23 +00:00
soulitzer	884eadcd19	Fix multi grad hooks thread safety (#132055 ) Thanks @awgu for spotting this Pull Request resolved: https://github.com/pytorch/pytorch/pull/132055 Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/albanD	2024-07-29 22:32:59 +00:00
Edward Z. Yang	e55e9d8126	Clear speculation log when restarting due to compiler collective (#131983 ) The compiler collective can trigger an input to become dynamic, which can trigger operations to be recorded to the graph, which would change the speculation log entries (since they only start being recorded once we have a non-empty output graph). Test case triggers this situation. Production instance: https://www.internalfb.com/mlhub/pipelines/runs/mast/f584750649-TrainingApplication?job_attempt=2&version=0&env=PRODUCTION Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131983 Approved by: https://github.com/anijain2305, https://github.com/mlazos	2024-07-29 22:32:10 +00:00
PyTorch MergeBot	62b2e7a553	Revert "Add config option to skip autotuning conv (#131839 )" This reverts commit 3d4de8e96d0bb1fe19b25734a97a19dd85313692. Reverted https://github.com/pytorch/pytorch/pull/131839 on behalf of https://github.com/eellison due to wrong config name ([comment](https://github.com/pytorch/pytorch/pull/131839#issuecomment-2257117221))	2024-07-29 22:31:51 +00:00
Janani Sriram	8fe2bf212d	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519 Approved by: https://github.com/davidberard98 ghstack dependencies: #131518	2024-07-29 22:16:32 +00:00
jainapurva	d039b14207	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-29 21:49:06 +00:00
Yang Chen	05a8540041	[cpp-wrapper] create null pointer for zero-size array (#132023 ) zero-size array is not supported in the C or C++ standard, so we create a null pointer for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132023 Approved by: https://github.com/desertfire	2024-07-29 21:40:33 +00:00
Andrew Gu	d8358a2d86	Made `register_multi_grad_hook` return type `RemovableHandle` (#132074 ) `_MultiHandle` is private. Let us return `RemovableHandle`, which is public. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132074 Approved by: https://github.com/soulitzer	2024-07-29 21:29:34 +00:00
PyTorch MergeBot	d5e9fbb012	Revert "BE: reset dynamo before each test in test_module.py (#131372 )" This reverts commit 527901f054a947976dc587bb9cf72c86992b7c87. Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
PyTorch MergeBot	a4723b566f	Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397 )" This reverts commit ca8153ae6758fbf33cc767cfd0cb384b87b8d3ca. Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
Tom Ritchford	bdf5a6dca9	Add decomposition for unsqueeze_copy (#130942 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942 Approved by: https://github.com/peterbell10	2024-07-29 21:13:37 +00:00
Yanbo Liang	3c1562158e	[BE] Fix torch.compile docstring formatting issues (#131837 ) Fixes #131815 <img width="1098" alt="Screenshot 2024-07-25 at 6 58 39 PM" src="https://github.com/user-attachments/assets/d0f6edc3-419e-4096-803b-cecd45d8644b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131837 Approved by: https://github.com/williamwen42	2024-07-29 20:52:28 +00:00
Simon Mahns	dcb03106b7	[Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007 ) Summary: as title Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962 Differential Revision: D60335413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007 Approved by: https://github.com/hanzlfs, https://github.com/egienvalue	2024-07-29 20:47:18 +00:00
Joona Havukainen	082d0b80ca	Min and max NaN propagation fix in MPS backend (#130445 ) Partial fix to issue #130295 Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445 Approved by: https://github.com/malfet	2024-07-29 20:09:15 +00:00
Animesh Jain	f44446e851	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #132053	2024-07-29 20:01:51 +00:00
Sam Larsen	4c2bcf92cb	[inductor] Enable FX graph caching in OSS by default (#125863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-07-29 19:19:54 +00:00
Xu Han	484852c02b	[Doc] update guide install mkl-static from conda to pip (#130026 ) <img width="619" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4ac3ca68-57dc-42c7-ac7a-876dc377ebcf"> Conda intel channel is not avaliable now. Use `pip` install instead of `conda`. `Windows` and `Linux` are avaliable: Binary list: https://pypi.org/project/mkl-static/#files `MacOS` is avaliable for old version: https://pypi.org/project/mkl-static/2021.3.0/#files TODO: 1. cherry-pick to `release/2.4` branch, @atalman . 2. fix it also in `release/2.3` branch: https://github.com/pytorch/pytorch/pull/131853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130026 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-07-29 19:19:15 +00:00
Aidyn-A	301ec32ae8	[EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059 ) Per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059 Approved by: https://github.com/Skylion007	2024-07-29 19:15:37 +00:00
Xuehai Pan	5cc34f61d1	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet ghstack dependencies: #131151	2024-07-29 18:53:14 +00:00
Xuehai Pan	4694ee1ad2	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-29 18:53:14 +00:00
cyy	ab912b7fef	[2/N] Fix clang-tidy warnings in inductor (#132040 ) Follows #131979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132040 Approved by: https://github.com/Skylion007	2024-07-29 18:41:24 +00:00
cyy	c764ef6d53	[9/N] Fix clang-tidy warnings in jit (#132010 ) Follows #131997 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132010 Approved by: https://github.com/Skylion007	2024-07-29 18:38:35 +00:00
Animesh Jain	f389bca2e9	[dynamo][inline_inbuilt_nn_modules] Skip test_dpp_graphs for now (#132053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132053 Approved by: https://github.com/laithsakka	2024-07-29 17:59:47 +00:00
Edward Z. Yang	6c6fbb4691	Fix pyi annotation for ProcessGroupNCCL.Options (#130957 ) Probably all the other options need updating too, but this is the one I needed. The accurate annotation was determined by reading torch/csrc/distributed/c10d/init.cpp Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130957 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-07-29 17:46:01 +00:00
Yang Chen	025242d065	[cpu-test] enable test_cpu_repro in fbcode (#132022 ) Summary: This diff enables test_cpu_repro in fbcode Test Plan: ci Differential Revision: D60364517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132022 Approved by: https://github.com/desertfire	2024-07-29 17:45:26 +00:00
Shunting Zhang	ca8153ae67	BE: reset dynamo before each test in test_ops_gradients.py (#131397 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388, #131372	2024-07-29 17:39:23 +00:00
Shunting Zhang	527901f054	BE: reset dynamo before each test in test_module.py (#131372 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388	2024-07-29 17:39:23 +00:00
Aaron Gokaslan	bd1a29b158	[BE][Ez]: Update ruff to 0.5.5. Bugfixes and better LSP support (#132037 ) Updates ruff to the latest and greatest, mainly better LSP support and bugfixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/132037 Approved by: https://github.com/malfet	2024-07-29 16:57:13 +00:00
PyTorch MergeBot	6cf493158e	Revert "Enable FlashAttention on Windows (#131906 )" This reverts commit b90bc66766c3503c1f229660710a803488d53c16. Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))	2024-07-29 16:49:23 +00:00
eellison	3d4de8e96d	Add config option to skip autotuning conv (#131839 ) requested internally bc for some models the conv templates are not very helpful Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839 Approved by: https://github.com/oulgen ghstack dependencies: #131400	2024-07-29 16:43:58 +00:00
PyTorch MergeBot	e73a4cb21f	Revert "[pt2e][quant] Ensure BN node is erased after convert (#131651 )" This reverts commit eba2ffd278a004df8fd335328ab8ba00c978e471. Reverted https://github.com/pytorch/pytorch/pull/131651 on behalf of https://github.com/ZainRizvi due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131651#issuecomment-2256407968))	2024-07-29 16:42:24 +00:00
PyTorch MergeBot	f72266ecea	Revert "Let dynamo inline functional_call (#128646 )" This reverts commit 5aab1acc84ff4a4374c9ddd179be48b07c6c8a74. Reverted https://github.com/pytorch/pytorch/pull/128646 on behalf of https://github.com/clee2000 due to the newly added test dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers [GH job link](https://github.com/pytorch/pytorch/actions/runs/10147452270/job/28058682000) [HUD commit link](`5aab1acc84`) is broken, probably a landrace since it passed on PR ([comment](https://github.com/pytorch/pytorch/pull/128646#issuecomment-2256375501))	2024-07-29 16:26:50 +00:00
Tom Ritchford	962f248437	Add decomposition for expand_copy (#130940 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940 Approved by: https://github.com/peterbell10	2024-07-29 16:23:56 +00:00
rzou	e393c7fa05	Tighten torch.library.infer_schema input types (#130705 ) Made the following changes: - mutates_args is now keyword-only and mandatory. This is to align with torch.library.custom_op (which makes it mandatory because it's easy to miss) - op_name is now keyword-only. This helps the readability of the API - updated all usages of infer_schema This change is not BC-breaking because we introduced torch.library.infer_schema a couple of days ago. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705 Approved by: https://github.com/yushangdi ghstack dependencies: #131777	2024-07-29 16:01:19 +00:00
PyTorch MergeBot	957a89f56c	Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 )" This reverts commit 03760be2714c6ed3b4f44c4dc3ea016f557d8597. Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](`03760be271`) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))	2024-07-29 15:52:08 +00:00
Aaron Gokaslan	ca254d145f	[BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036 ) Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036 Approved by: https://github.com/malfet	2024-07-29 15:50:00 +00:00
Guilherme Leobas	5aab1acc84	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519 ghstack dependencies: #129091, #130490	2024-07-29 15:41:03 +00:00
Guilherme Leobas	e0e4e84ef9	wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #129091	2024-07-29 15:41:03 +00:00
Guilherme Leobas	1e9cdf7d91	Relax constraints for creating a `GenericContextWrappingVariable` (#129091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-07-29 15:40:59 +00:00
Brian Hirsh	6cbad37bee	make `_inductor.config.rocm.supported_arch` set order deterministic for caching (#131921 ) This fixes some AOTAutograd caching tests that were failing flakily internally because they would occasionally cache miss. [T195598220](https://www.internalfb.com/intern/tasks/?t=195598220) I found it by running some stress tests and diffing the AOT cache information on each run, and ended up with this diff (`rocm.supported_arch` order was changing from run to run, although apparently not in OSS): ``` --- tmpa.txt 2024-07-26 11:03:46.220924798 -0700 +++ tmpb.txt 2024-07-26 11:03:44.053586437 -0700 @@ -1,4 +1,4 @@ -Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74: +Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh: [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False) [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False @@ -184,7 +184,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False -[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'} +[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'} [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False @@ -231,7 +231,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[verbose_progress]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[warn_mix_layout]: False [a44txxznx23htuc7zxw7larc7yxpxzxmiqzloxznw7z2k2azqj3] inductor_config[worker_start_method]: fork -Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74: +Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh: [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False) [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False @@ -417,7 +417,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False -[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'} +[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'} [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131921 Approved by: https://github.com/jamesjwu, https://github.com/oulgen	2024-07-29 15:29:04 +00:00
Ruichen Sun	14108c1677	Fix error handling in _triton.py (#132006 ) On Windows, _triton.py creates a confusing error ("RuntimeError: Should never be _installed")_ as triton is not supported in Windows. This is not caught in the current Pytorch exception handling. This pull request adds a new exception handling for the runtime error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132006 Approved by: https://github.com/oulgen	2024-07-29 15:02:25 +00:00
Bin Bao	be3eba382f	[CI] Run perf test for perf_cpu_aarch64 (#132038 ) Summary: Run perf test for perf_cpu_aarch64 instead of regular CI test (test_linux_aarch64). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132038 Approved by: https://github.com/malfet	2024-07-29 13:48:40 +00:00
PyTorch MergeBot	c35f21e5fc	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit 14158d892a2bd9b34edb5637f9a05217ea0330bd. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](`14158d892a`) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))	2024-07-29 13:19:38 +00:00
PyTorch MergeBot	06fe99a097	Revert "[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 )" This reverts commit dfa18bf3f39c5a90b48baf956e50fa7da4462d3d. Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))	2024-07-29 13:09:41 +00:00
PyTorch MergeBot	7ef927da15	Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275 )" This reverts commit 6de65d5dd4226b6bae15352b575c81a6750c819b. Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/atalman due to Broke CI: dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10132084288/job/28016215101) [HUD commit link](`6de65d5dd4`) ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2255839646))	2024-07-29 12:48:27 +00:00
cyy	efca51e171	[8/N] Fix clang-tidy warnings in jit (#131997 ) Follows #131996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131997 Approved by: https://github.com/Skylion007	2024-07-29 12:40:42 +00:00
PyTorch MergeBot	eb9409511e	Revert "support zb1p and zb2p algorithms (#130752 )" This reverts commit 8fe5b93667b60e37c12d288659a25cbd5ae53c79. Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](`8fe5b93667`) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))	2024-07-29 12:40:00 +00:00
pruthvistony	9d497887b8	Changes to support clang-19 (#131905 ) Co-authored-by: pruthvistony <pruthvigithub@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131905 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2024-07-29 12:38:23 +00:00
cyy	b67811abda	[1/N] Fix clang-tidy warnings in inductor (#131979 ) Fixes clang-tidy warnings in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131979 Approved by: https://github.com/Skylion007	2024-07-29 12:37:56 +00:00
Chengji Yao	d47c470f47	[dynamo] implement `var_getattr` in UserFunctionVariable (#130413 ) This PR addresses the `getattr` of UserFunctionVariable. Although this usage is uncommon, it does appear in [Megatron's code](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py#L635). ``` def linear_with_grad_accumulation_and_async_allreduce(...): .... if not linear_with_grad_accumulation_and_async_allreduce.warned: .... .... linear_with_grad_accumulation_and_async_allreduce.warned = False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130413 Approved by: https://github.com/yanboliang	2024-07-29 08:29:59 +00:00
Xuehai Pan	dfa18bf3f3	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet	2024-07-29 07:40:42 +00:00
Shunting Zhang	f151f25c0b	BE: reset dynamo before each test in test_torch.py (#131388 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388 Approved by: https://github.com/zou3519 ghstack dependencies: #131551	2024-07-29 04:57:34 +00:00
Wu, Chunyuan	30e7fc0fe1	Cpp wrapper: set args to CppWrapperKernelArgs in cpp template kernel (#129557 ) Fix the compilation error: ```cpp /tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16’ {aka ‘const c10::BFloat16’} 401 \| cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1); \| ^~~~~~ \| \| \| at::Tensor ``` The generated code after the fix will be: ```cpp cpp_fused_div_mm_0((bfloat16)(arg2_1.data_ptr()), (bfloat16)(constant2.data_ptr()), (bfloat16)(_frozen_param1.data_ptr()), (bfloat16)(buf1.data_ptr())); ``` Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557 Approved by: https://github.com/leslie-fang-intel	2024-07-29 04:01:17 +00:00
Peter Bell	03760be271	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-29 03:14:13 +00:00
Yan Zhiwei	2a02b5cd22	[Intel GPU] Dispatch Stub support (#130019 ) # Motivation Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way. We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs. This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-07-29 02:18:52 +00:00
cyy	5b3b2b9cc7	[7/N] Fix clang-tidy warnings in jit (#131996 ) Follows #131986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131996 Approved by: https://github.com/ezyang	2024-07-29 01:21:18 +00:00
cyy	ddd539ba6c	[6/N] Fix clang-tidy warnings in jit (#131986 ) Follows #131969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131986 Approved by: https://github.com/ezyang	2024-07-29 00:49:08 +00:00
Tianyu Liu	7b0e10f0e5	fix _MaskPartial when multiple embeddings coexist (#131264 ) Previously, using _MaskPartial when multiple embeddings have the following issues: 1. Suppose an `nn.Embedding` has shape `[vocab_size, emb_size]`. When there are more than one embeddings, sharing the same `vocab_size` but with different `emb_size`s. Then they would not share `OpStrategy` since each, when involved in computation, would have different `OpSchema`; however, there would be cache hit for redistribute (specifically `_gen_transform_infos` in `torch/distributed/_tensor/_redistribute.py` when doing `Replicate` -> `_MaskPartial`) as the `_MaskPartial` only has `vocab_size` as `logical_dim_size` but not `emb_size` as attribute. This cache hit is undesirable and would cause trouble when doing all-reduce/reduce-scatter on the new `_MaskPartial` in a separate `OpStrategy`. The error was reported in #130725. In this PR, we introduce `offset_shape` to represent the embedding's full shape to avoid cache hit from embeddings of different shapes. 2. The second issue is when we have two `nn.Embedding`s `emb1` and `emb2` with the same shape. There will be cache hit not only in `_gen_transform_infos`, but also in `OpStrategy` generation. Previously, if we sequentially do `Replicate` -> `_MaskPartial` for both `emb1` `emb2` and then sequentially do reduction on the `_MaskPartial` of `emb1`, it would destroy the `MaskBuffer` and `emb2` would hit error. This PR adds a `refcount` for the `MaskBuffer` so that it can be properly shared by multiple `nn.Embedding`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131264 Approved by: https://github.com/wanchaol	2024-07-29 00:40:58 +00:00
Peter Bell	0ab6551bcb	[inductor] Handle NoneLayout in count_numel (#131645 ) We're currently under-counting mutations from ExternKernel since they use `NoneLayout` which doesn't have an associated shape and dtype. Instead, we can get that information from the buffer being mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131645 Approved by: https://github.com/jansel	2024-07-28 23:02:22 +00:00
cyy	7c1fbc7fe9	[5/N] Remove unused parameter (#131998 ) Follows #131291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131998 Approved by: https://github.com/ezyang	2024-07-28 21:29:06 +00:00
Nikita Shulga	f901b02066	[Distributed] Do not expose `nlohmann/json.hpp` in public headers (#131925 ) Move `<hlohmann/json.hpp>` dependency as well as `NCCLTraceBuffer::getCollectiveTraceJson` and `NCCLTraceBuffer::dump_json` implementation introduced by https://github.com/pytorch/pytorch/pull/129505 from the header into .cpp file. This relaxes the requirement on all downstream client to depend on the library Fixes https://github.com/pytorch/pytorch/issues/130678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131925 Approved by: https://github.com/albanD, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #131922	2024-07-28 18:45:24 +00:00
Oguz Ulgen	75c8d59ea1	Remove mypy ignore from torch/_dynamo/variables/lazy.py (#131785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131785 Approved by: https://github.com/aorenste, https://github.com/zou3519 ghstack dependencies: #131786, #131870	2024-07-28 17:13:53 +00:00
Oguz Ulgen	7c29665f77	Remove mypy ignore from torch/testing/_internal/distributed/ (#131870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870 Approved by: https://github.com/aakhundov ghstack dependencies: #131786	2024-07-28 17:13:53 +00:00
Oguz Ulgen	2e4807575c	Remove mypy ignore from torch/_dynamo/polyfill.py (#131786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131786 Approved by: https://github.com/aorenste, https://github.com/zou3519	2024-07-28 17:13:49 +00:00
Adnan Akhundov	cc512ea0f6	[inductor] Fix flaky tests in test_aot_inductor.py (#131994 ) Summary: The `test_model_modified_weights` in `test_aot_inductor.py` has been failing internally for a while. The behavior leading to the test failure was that, after updating the eager model's weights and recompiling the (CPU) model with AOTI, the output of the model was identical to the one before the weights were updated. The root cause is here in Python: `8927fc209f/test/inductor/test_aot_inductor_utils.py (L69-L71)` which, in turn, instantiates the `Runner` object in C++ relying on `dlopen` for loading the .so. The problem is that repeated `dlopen` call does not reload the library from the same path, unless `dlclose` is called in-between the two `dlopen` calls. There is `dlclose` in the `Runner`'s destructor, but it's not called, likely due to the way the loaded `runner` gets closed over in Python: `8927fc209f/test/inductor/test_aot_inductor_utils.py (L83-L94)` Here we add copying the .so file to a unique temporary path right before loading it into a `runner` to avoid the `dlopen` staleness described above. This fixes the `test_model_modified_weights` and, hopefully, will help avoiding similar errors in the future tests. Test Plan: Tested internally. Differential Revision: D60348165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131994 Approved by: https://github.com/chenyang78	2024-07-28 16:55:22 +00:00
Animesh Jain	6de65d5dd4	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #131744, #131928, #131948	2024-07-28 13:23:00 +00:00
Adnan Akhundov	8927fc209f	[inductor] Add type hints to functions in debug.py (#131836 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131836 Approved by: https://github.com/eellison	2024-07-28 04:54:22 +00:00
Huy Do	500aea8d50	Build PT aarch64 on arm runner (#131964 ) Another fix is needed to address https://github.com/pytorch/pytorch/actions/runs/10118374576/job/27985575620. The build needs to be done on arm runner to stay compatible with the Docker image. ### Testing https://github.com/pytorch/pytorch/actions/runs/10118589329/job/27985670691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131964 Approved by: https://github.com/malfet	2024-07-28 04:50:38 +00:00
PyTorch MergeBot	945bf78894	Revert "[BE] typing for decorators - fx/_compatibility (#131568 )" This reverts commit 193f62fde91ee20deb5ddcd9ff4593cd78d74c64. Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	b002ec61b6	Revert "[BE] typing for decorators - masked/_ops (#131569 )" This reverts commit aa58af8b43ad0e615415b4d754255f5be481d41a. Reverted https://github.com/pytorch/pytorch/pull/131569 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	a3ba405871	Revert "[BE] typing for decorators - library (#131570 )" This reverts commit 5731b486c87bedff69aa0264d6c934bf723eb513. Reverted https://github.com/pytorch/pytorch/pull/131570 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	a0abb77007	Revert "[BE] typing for decorators - distributed/_tensor/ops/utils (#131571 )" This reverts commit 4b985e6f803023ec301238d2b4bab4fbea4dd03c. Reverted https://github.com/pytorch/pytorch/pull/131571 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
Yifu Wang	a8a9882899	Implement fused_scaled_matmul_reduce_scatter for async-TP (#131950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131950 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831, #131832, #131833	2024-07-28 03:39:12 +00:00
Yifu Wang	0538a69a8d	[micro_pipeline_tp] support all-gather -> _scaled_mm (#131833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131833 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831, #131832	2024-07-28 03:39:11 +00:00
Yifu Wang	492e9a4886	[micro_pipeline_tp] add support for type-erased all-gather pattern observed in DTensor + float8_experimental (#131832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131832 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831	2024-07-28 03:39:11 +00:00
PyTorch MergeBot	fd5b7d4bf9	Revert "[BE] typing for decorators - _meta_registrations (#131572 )" This reverts commit bfe0079b72aa3ed315ae8f140c97a5826c401a65. Reverted https://github.com/pytorch/pytorch/pull/131572 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	609447a626	Revert "[BE] typing for decorators - _jit_internal (#131573 )" This reverts commit f0f20f7e97716b4b077dca2a1a42930ccf990c1c. Reverted https://github.com/pytorch/pytorch/pull/131573 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	4684b8e9d7	Revert "[BE] typing for decorators - _inductor/lowering (#131574 )" This reverts commit b2cbcf710b26c4cb92d810fff46b6ddcb8d10cbf. Reverted https://github.com/pytorch/pytorch/pull/131574 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	07b7f51877	Revert "[BE] typing for decorators - _inductor/fx_passes/post_grad (#131575 )" This reverts commit 42dc5a47a157f9a441ceba53cf569cc42a640732. Reverted https://github.com/pytorch/pytorch/pull/131575 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	6a0c3bae21	Revert "[BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576 )" This reverts commit 37d76c7d48353cff5ed0d868b7ca486ad092ceaf. Reverted https://github.com/pytorch/pytorch/pull/131576 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	b1d640a2b7	Revert "[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577 )" This reverts commit 5ee6a6dacc926da37ebe06e4206dcc307bf891f5. Reverted https://github.com/pytorch/pytorch/pull/131577 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	d3c17fea90	Revert "[BE] typing for decorators - _library/custom_ops (#131578 )" This reverts commit c65b197b85aeee61ed4c09527a8f6eecf8c20e27. Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	065d0fe570	Revert "[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579 )" This reverts commit 79f0c4dc04c7976b734767d64c4833932219dcfb. Reverted https://github.com/pytorch/pytorch/pull/131579 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	5ced63a005	Revert "[BE] typing for decorators - utils/flop_counter (#131580 )" This reverts commit 81c26ba5ae1edf95da8f6956ae4b5ad23c9833c6. Reverted https://github.com/pytorch/pytorch/pull/131580 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	2c4023d65f	Revert "[BE] typing for decorators - _refs/nn/functional (#131581 )" This reverts commit dbf7c318b2dd4652467f11f4aaebaa3ed372e728. Reverted https://github.com/pytorch/pytorch/pull/131581 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	e448f32944	Revert "[BE] typing for decorators - signal/windows/windows (#131582 )" This reverts commit 8689d377f9b60b70efa6608e654a3889f947f4d8. Reverted https://github.com/pytorch/pytorch/pull/131582 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	d90f6b45c0	Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 )" This reverts commit fb3ddafbcfe6de1c4b208c020bc5ff4c4c4faf79. Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2254327833))	2024-07-28 03:26:14 +00:00
PyTorch MergeBot	8f5cf46405	Revert "Fix public API tests (#131386 )" This reverts commit 91fcfd87600545c19b975bd6ea134f2f931bf84a. Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))	2024-07-28 03:23:04 +00:00
cyy	7be0ce51b6	Fix handle serialization error (#131871 ) This is a bug to try serialise std::string in C API Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131871 Approved by: https://github.com/Skylion007	2024-07-28 00:33:20 +00:00
Aaron Orenstein	3e0ccb3a9f	Fixing fake tensor SymInt caching (#131966 ) Summary: Some tests are failing because of a weird interaction between the symbolic sizes and the `set()` - back it out for now. Differential Revision: D60320595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131966 Approved by: https://github.com/oulgen	2024-07-27 22:43:57 +00:00
Shuo Ding	d07a125af2	[Inductor] supporting pointwise intermediate nodes in B2B-GEMM (#131685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131685 Approved by: https://github.com/eellison	2024-07-27 20:11:20 +00:00
Xuehai Pan	14158d892a	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-27 19:39:40 +00:00
albanD	466ea8ce54	Add fallback() to torch.library (#131707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131707 Approved by: https://github.com/zou3519	2024-07-27 18:02:35 +00:00
cyy	8e5a367311	[5/N] Fix clang-tidy warnings in jit (#131969 ) Follows #131903 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131969 Approved by: https://github.com/ezyang	2024-07-27 17:54:20 +00:00
Xuehai Pan	918ece4f4d	[BE][Easy][11/19] enforce style for empty lines in import segments in `test/dy*/` (#129762 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129762 Approved by: https://github.com/anijain2305	2024-07-27 17:43:53 +00:00
Angela Yi	ae9f17a821	[aoti] Rename OSS DynamicArg and OpKernel (#131862 ) Summary: Fixing P1495466240 which I think is due to the fact that internal also has an "OpKernel" in the same namespace, using thrift instead of json. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4785074844896831 Differential Revision: D60273354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131862 Approved by: https://github.com/desertfire	2024-07-27 17:34:50 +00:00
PyTorch MergeBot	8cdfdb41bc	Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 )" This reverts commit f862f457304f1952e75336f9f74e4ea3d2a5eb72. Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/atalman due to broke CI: test_nestedtensor.py::TestNestedTensorSubclassCPU::test_layer_norm_with_lengths_requires_grad_False_components_require_grad_False_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10121747545/job/27996722731) [HUD commit link](`f862f45730`) ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2254167994))	2024-07-27 14:45:47 +00:00
Nikita Shulga	07389163f0	[C10][BE] Use range loop (#131922 ) Non-function change that iterates over entries in `getCollectiveTraceJson` and uses `C10_UNUSED` rather than `(void)i;` trick Pull Request resolved: https://github.com/pytorch/pytorch/pull/131922 Approved by: https://github.com/XilunWu	2024-07-27 11:26:27 +00:00
cyy	f83ef69b84	Fix typo in assignment operators (#131890 ) Most typos were introduced in #131077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131890 Approved by: https://github.com/Skylion007	2024-07-27 11:13:42 +00:00
cyy	c82441e07a	Fix std::optional checking bug (#131874 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131874 Approved by: https://github.com/Skylion007	2024-07-27 11:08:10 +00:00
Yifu Wang	93a4671746	Add out_dtypes to fused_all_gather_scaled_matmul's args (#131831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131831 Approved by: https://github.com/weifengpy ghstack dependencies: #131410	2024-07-27 11:07:43 +00:00
Yifu Wang	12cd040edd	[micro_pipeline_tp] exclude simple overlappable collectives as micro-pipeline TP candidates when reorder_for_compute_comm_overlap is enabled (#131410 ) When a collective can be hidden through either simple overlapping or micro-pipeline TP, we prefer simple overlapping to avoid the overhead associated with decomposition. If `reorder_for_compute_comm_overlap` is enabled, we identify collectives that can be hidden through simple overlapping and exclude them from micro-pipeline TP candidates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131410 Approved by: https://github.com/weifengpy	2024-07-27 11:07:43 +00:00
Animesh Jain	36d24925c6	[inline_inbuilt_nn_modules][inductor-cpu] More skips for dynamic shapes when inlining enabled (#131948 ) The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131948 Approved by: https://github.com/eellison, https://github.com/leslie-fang-intel ghstack dependencies: #131744, #131928	2024-07-27 10:03:49 +00:00
Will Feng	aee6bcdba4	[Traceable FSDP2][Inductor] Apply compute/comm reordering passes to achieve overlap (#131614 ) This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614 Approved by: https://github.com/yifuwang ghstack dependencies: #131510	2024-07-27 08:39:58 +00:00
Will Feng	9e06572704	[Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510 ) This PR creates these `GroupedSchedulerNode`s: - One for each all-gather code block (cast + copy-in + all-gather) - One for each all-gather-wait code block (all-gather-wait + copy-out) - One for each reduce-scatter code block (copy-in + reduce-scatter) - One for each reduce-scatter-wait code block (reduce-scatter-wait) This serves two goals: - Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage. - Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms"). The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510 Approved by: https://github.com/yifuwang	2024-07-27 08:39:58 +00:00
cyy	99e13e68e9	[4/N] Fix clang-tidy warnings in jit (#131903 ) Follows #131830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131903 Approved by: https://github.com/Skylion007	2024-07-27 08:08:14 +00:00
Janani Sriram	f862f45730	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519 Approved by: https://github.com/davidberard98 ghstack dependencies: #131518	2024-07-27 07:09:10 +00:00
Janani Sriram	bcf5c68c18	[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518 ) Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518 Approved by: https://github.com/davidberard98	2024-07-27 07:09:10 +00:00
Avik Chaudhuri	c49e857d32	[pt] immutable accessors in graph signature (#131940 ) Summary: splitting PT part of D60253955 Test Plan: existing tests Differential Revision: D60296909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131940 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-07-27 05:32:53 +00:00
Oguz Ulgen	96c1862e0b	Remove mypy ignore from torch/_dynamo/variables/__init__.py (#131784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131784 Approved by: https://github.com/aorenste, https://github.com/zou3519, https://github.com/Skylion007	2024-07-27 05:07:33 +00:00
drisspg	1bfe7eb7e6	Update how we do sdpa testing (#131743 ) ## Motivation This refactor aligns our testing methodology with the Flash Attention upstream repository while addressing several key issues: 1. Standardized comparison: We now compare fused kernels against float64 references, using the maximum of a calculated tolerance (based on same-precision math implementation) or standard float32 `atol`. 2. Reduced redundancy: Utilizing the same tensors for both same-precision math and fused kernel runs eliminates duplication. 3. Improved maintainability: The new approach simplifies tolerance adjustments across all affected tests. 4. Consistency: Standardizing tensor comparisons ensures a more uniform and reliable testing suite. These changes collectively simplify our testing code, improve its maintainability, and provide a more robust framework for validating our attention mechanisms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131743 Approved by: https://github.com/jainapurva, https://github.com/jbschlosser	2024-07-27 03:58:49 +00:00
Vishwa Raj Singh	bcdba9f91d	Added hpu backend support in fsdp utils (#127757 ) In fsdp init_utils, adding support for hpu backend device on _get_device API. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757 Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu	2024-07-27 03:30:59 +00:00
Xu Han	28fd2e905d	[inductor] enhance cpp_builder lint check. (#131752 ) enhance cpp_builder `mypy` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131752 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 02:46:27 +00:00
Xu Han	a90b8b967a	[inductor] enable windows inductor UTs (#131767 ) Changes: 1. Add `skipIfWindows` function. 2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules. 3. Disable some UTs, which are not passed on Windows. 4. Enable test_torchinductor in Windows CI. I have tested passed on my dev machine: <img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb"> TODO: review and fix the skipped cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 02:46:03 +00:00
Avik Chaudhuri	3768faec2f	carry cond in data-dependent error (#131932 ) Test Plan: existing Differential Revision: D60302877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131932 Approved by: https://github.com/zhxchen17	2024-07-27 02:13:04 +00:00
Xu Han	9606d61e0c	[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. 2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access. 3. Add `TODO` comments for further some Meta employee help on contine to do this work. 4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 01:46:13 +00:00
Matthew Hoffman	fdf1451bfa	Add `__all__` to torch.optim to define public interface (#131959 ) There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15) ``` error: "SGD" is not exported from module "torch.optim" ``` Adding these classes/modules to `__all__` fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959 Approved by: https://github.com/ezyang	2024-07-27 01:03:25 +00:00
Sergii Dymchenko	8458980bbf	Move benchmarks/dynamo/huggingface configuration to YAML (#131724 ) Similar to https://github.com/pytorch/pytorch/pull/120299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724 Approved by: https://github.com/shunting314	2024-07-27 00:55:04 +00:00
Zain Rizvi	ef8d118c67	Sync with changes to test-infra's scale-config.yml (#131955 ) This synchronized lf-canary-scale-config and lf-scale-config with one in test-infra. This really needs some automatic validation to prevent it from drifting out of sync over and over again (coming soon...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131955 Approved by: https://github.com/malfet	2024-07-27 00:25:40 +00:00
Nikita Shulga	8b04edcac1	Delete unused yml files (#131298 ) To be landed at least 3 days later after previous commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/131298 Approved by: https://github.com/ZainRizvi ghstack dependencies: #130762	2024-07-27 00:21:22 +00:00
Zain Rizvi	1e00f055a4	Move distributed experimental jobs back to the amazon2 for now (#131963 ) Something about the new Amazon2023 AMI is making some distributed tests fail. Moving them back to the old AMI until the issue is fixed This particular jobs are causing this test to fail: https://github.com/pytorch/pytorch/issues/129539 More details in https://github.com/pytorch/pytorch/issues/131962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131963 Approved by: https://github.com/clee2000	2024-07-26 23:44:56 +00:00
Joel Schlosser	91fcfd8760	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-26 23:38:43 +00:00
Shangdi Yu	02b922900b	[aoti] Fix float16 and bfloat16 for generated GPU code (#131437 ) Fixes #131333 Summary: - Add header to define `float16` and `bfloat16` as `at::Half` and `at::BFloat16`. - change `float16` and `bfloat16` to `float` before passing to kernel. code generated before: ```cpp ..... half var_1; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1)); .... ``` code generated now: ```cpp typedef at::Half half; typedef at::BFloat16 bfloat16; ..... half var_1_tmp; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1_tmp)); float var_1 = float(var_1_tmp); .... ``` Test plan: `TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_unspec_inputs_cuda` Work in progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131437 Approved by: https://github.com/desertfire	2024-07-26 23:36:11 +00:00
Bin Bao	0272934238	[Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812 ) Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory. ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module> from torch.torch_version import __version__ as __version__ File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module> from torch.version import __version__ as internal_version ModuleNotFoundError: No module named 'torch.version' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812 Approved by: https://github.com/eellison, https://github.com/malfet	2024-07-26 22:31:44 +00:00
Sergii Dymchenko	5489ff8e94	Use Mermaid for the diagram in torch/ao/quantization/fx/README.md (#131412 ) preview `3a0efcdfa3/torch/ao/quantization/fx/README.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131412 Approved by: https://github.com/jerryzh168	2024-07-26 22:01:21 +00:00
Peter Bell	16cd1aaa1d	[inductor] Improve sort kernel perf (#131719 ) Closes #129507 This makes two changes to the sort kernel: 1. Use int16 for the indices since we only operate on small dims anyway 2. Instead of passing an explicit mask, we pass the rnumel and imply the mask from that which saves an additional reduction in the sort kernel's inner loop. In my benchmarks, this gives enough of a perf improvement to bump up the max rblock to 512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719 Approved by: https://github.com/eellison	2024-07-26 21:56:47 +00:00
Luca Wehrstedt	b90bc66766	Enable FlashAttention on Windows (#131906 ) Let's just give this a try. Reland of https://github.com/pytorch/pytorch/pull/131875. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906 Approved by: https://github.com/drisspg	2024-07-26 21:41:56 +00:00
rzou	d73b55d64b	Support meta tensors as inputs to the triton_kernel_wrapper HOPs (#131896 ) We automatically generate FakeTensor support for them (the FakeTensor kernel for a triton kernel is "return None"). The same thing should apply to the meta kernel. Tests: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131896 Approved by: https://github.com/oulgen	2024-07-26 21:41:03 +00:00
Animesh Jain	fb98cd33f1	[inline_inbuilt_nn_modules][inductor-cpu] Skip test_quantized_linear_amx (#131928 ) The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131928 Approved by: https://github.com/eellison ghstack dependencies: #131744	2024-07-26 21:28:17 +00:00
Shunting Zhang	c8626a4e1f	[BE] add a list of inductor test files to skip resetting dynamo (#131551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551 Approved by: https://github.com/zou3519	2024-07-26 21:08:15 +00:00
Catherine Lee	fde577702d	[TD] More synonyms for filepath (#131838 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131838 Approved by: https://github.com/PaliC, https://github.com/ZainRizvi	2024-07-26 21:02:42 +00:00
Zain Rizvi	1bda3a3135	Migrate nightly.yml workflow & docs to Amazon 2023 (#131821 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 Migrates nightly jobs and the linux-docs job in pull.yml To preserve reusability, I'm switching to a new format here that allows one to only specify the runner prefix instead of the full runner name, allowing multiple jobs to continue using the same base runner type like how they did before Validation: - Nightly builds passed in the prev commit: https://github.com/pytorch/pytorch/actions/runs/10102118461/job/27937632823?pr=131821 - Latest commit only updated the docs job in pull.yml, and that has already passed: https://github.com/pytorch/pytorch/actions/runs/10114635537/job/27974392472?pr=131821 The other in-progress jobs are irrelevant Pull Request resolved: https://github.com/pytorch/pytorch/pull/131821 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-07-26 20:54:43 +00:00
James Wu	0e6df1e0fb	Disable remote cache on test (#131908 ) Summary: Fixes test internally Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees -- --exact 'caffe2/test/inductor:cudagraph_trees - test_cache_hit_forward_miss_backward (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)' Passes Differential Revision: D60293177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131908 Approved by: https://github.com/clee2000	2024-07-26 20:19:02 +00:00
Brian Hirsh	071ac38141	fast-path FakeTensor detach (#131899 ) Fixes https://github.com/pytorch/pytorch/issues/128281, see investigation at https://github.com/pytorch/pytorch/issues/128281#issuecomment-2252976926. benchmark: ``` python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM ``` time before: ``` TIMING: entire_frame_compile:30.85435 backend_compile:23.98599 total_wall_time:30.85435 ``` time after: ``` TIMING: entire_frame_compile:24.35898 backend_compile:18.15235 total_wall_time:24.35898 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131899 Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD	2024-07-26 20:16:08 +00:00
Catherine Lee	2ec8312a28	Add rerun_disabled_tests for inductor (#131681 ) Test in prod? THis also turns on mem leak check Briefly checked that ``` python3 ".github/scripts/filter_test_configs.py" \ --workflow "inductor" \ --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \ --test-matrix "{ include: [ { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" }, { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, ]} " \ --selected-test-configs "" \ --pr-number "${PR_NUMBER}" \ --tag "${TAG}" \ --event-name "schedule" \ --schedule "29 8 * * *" \ --branch "${HEAD_BRANCH}" ``` has rerun disabled tests option in the test matrix I don't think all these things need to run but I'm not sure which ones (probably just inductor?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681 Approved by: https://github.com/zou3519	2024-07-26 20:05:24 +00:00
Sergii Dymchenko	da1a1fa55f	Move load_yaml_file to common (#131924 ) This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924 Approved by: https://github.com/shunting314, https://github.com/huydhn	2024-07-26 19:47:52 +00:00
Bin Bao	6c95f79645	[CI] Increase the timeout for aarch64 docker build (#131926 ) Summary: Increase the timeout limit for pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks. If slow build is a problem later, we can upgrade the arm64 CI instance capability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131926 Approved by: https://github.com/avikchaudhuri	2024-07-26 19:27:45 +00:00
PyTorch MergeBot	782efd8e5b	Revert "Add rerun_disabled_tests for inductor (#131681 )" This reverts commit 85fa66be04b6f78139da4f0ec8f8b1956291e1c5. Reverted https://github.com/pytorch/pytorch/pull/131681 on behalf of https://github.com/clee2000 due to this is the wrong file ([comment](https://github.com/pytorch/pytorch/pull/131681#issuecomment-2253318038))	2024-07-26 19:08:59 +00:00
PyTorch MergeBot	0f9bf208ec	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit 054d214c504b415b155ef2da1a70764a115e1276. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))	2024-07-26 19:03:10 +00:00
rzou	a3cdbd8189	[FlopCounterMode] Fix register_flop_formula (#131777 ) Previously, FlopCounterMode would ignore any custom ops registered through `register_flop_formula`. The problem was: - register_flop_formula(target) requires target to be an OpOverloadPacket. - register_flop_formula used register_decomposition to populate its registry - register_decomposition decomposes the OpOverloadPacket into OpOverload before putting it into the registry - FlopCounterMode ignores OpOverloads in its registry (it assumes the registry is a dictionary mapping OpOverloadPacket to flop formula). register_decomposition is too heavy of a hammer, plus this isn't a decomposition, so I changed the registration mechanism. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131777 Approved by: https://github.com/Chillee	2024-07-26 18:44:50 +00:00
Vishwa Raj Singh	cd53698df0	Add hpu backend support for dynamo torchVariable _in_graph_classes() function (#129948 ) Fixes #ISSUE_NUMBER Recent change from PR# `f657b2b1f8 (diff-4a52059570bb96333d8383ce6a9d01bbb114c5e34aff6028f820899ca39b5a26R80)` , has hard coded flow to cuda stream in ingraph function. For non cuda backend (hpu in our case), it breaks the graph. As part of this PR change adding hpu backend support to dynamo variables function _in_graph_classes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129948 Approved by: https://github.com/yanboliang	2024-07-26 18:38:03 +00:00
eellison	5f2c80d16d	Add inductor OrderedSet (#130003 ) Implemented by extending `collections.abc.MutableSet` and backing it with a dictionary, which is ordered. From collections.abc.MutableSet: ``` A mutable set is a finite, iterable container. This class provides concrete generic implementations of all methods except for __contains__, __iter__, __len__, add(), and discard(). ``` In addition to implementing those methods I also had to define some methods of python's set which were not implemented in MutableSet. I reused the test from my python's lib. There were a few instances of tests that didnt pass because edge case behavior that is not necessary to reimplement - support self-referencing repr - erroring when an member's `__eq__` function would modify the set itself - MutableSet supports Iterables as inputs, but not sequences (pretty rare..) - Some specifics of exact equivalent type errors being thrown - [The protocol for automatic conversion to immutable](https://docs.python.org/2/library/sets.html#protocol-for-automatic-conversion-to-immutable) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130003 Approved by: https://github.com/aorenste	2024-07-26 18:16:57 +00:00
Mikayla Gawarecki	1dd10ac802	[BE] [Reland] Make nn.Module state_dict load_state_dict pre-hook and state_dict post-hook public (#131690 ) Reland https://github.com/pytorch/pytorch/pull/126704 #### Fixes the issue with type of `nn.Module._state_dict_hooks` being changed in that PR which was problematic: Instead of using `Tuple(Callable, bool)` to keep track of whether the private `_register_state_dict_hook` or the public `register_state_dict_post_hook` API was used to register the hook and toggle the behavior accordingly, I set an attribute on the Callable in the private API, which is never cleaned up. If a callable previously registered using the private API is registered via the public API, a RuntimeError will be raised #### Copied from previous PR description Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437 - `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook` - Add a test as this API was previously untested - `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True` ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~ - For issuet by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook - Document this for private `_register_state_dict_hook` - Remove this for the public `register_state_dict_post_hook` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131690 Approved by: https://github.com/albanD	2024-07-26 18:14:07 +00:00
Shuqiang Zhang	8158cf2f59	[c10d] Fix split_group usage when there is a single rank (#131824 ) Summary: This is a request from xlformer team to allow single rank PG/comms Test Plan: UT Pull Request resolved: https://github.com/pytorch/pytorch/pull/131824 Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj	2024-07-26 18:11:17 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
PyTorch MergeBot	e4db5dc1c4	Revert "[BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358 )" This reverts commit 4c7f22dee25649cd895bc382192d29f39e482215. Reverted https://github.com/pytorch/pytorch/pull/131358 on behalf of https://github.com/janeyx99 due to Internal uses this private API and landing that has been a pain so we're reverting this first ([comment](https://github.com/pytorch/pytorch/pull/131358#issuecomment-2253190654))	2024-07-26 17:35:27 +00:00
William Wen	2576dbbc35	[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725 ) Fixes https://github.com/pytorch/pytorch/issues/112794. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131725 Approved by: https://github.com/anijain2305 ghstack dependencies: #131413, #131716	2024-07-26 17:17:09 +00:00
William Wen	35b4de32fa	[dynamo] add itertools repeat/count bytecode reconstruction (#131716 ) Also fix bugs in the count iterator variable implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131716 Approved by: https://github.com/anijain2305 ghstack dependencies: #131413	2024-07-26 17:17:09 +00:00
Boyuan Feng	40cc5c0697	[AOT Autograd] Donated Buffer (#130580 ) Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor. Fixes #129496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580 Approved by: https://github.com/bdhirsh	2024-07-26 17:14:34 +00:00
Siyu Yang	9589d986fa	[UT] Relax atol for test_non_contiguous_input_* (3 tests) (#131822 ) BE task T195600898 (internal). The 3 tests ``` test_non_contiguous_input_mm test_non_contiguous_input_bmm test_non_contiguous_input_addmm ``` had the following error in TestX: ``` self.assertTrue(torch.allclose(ref, act, atol=1e-2, rtol=1e-2)) AssertionError: False is not true ``` The tolerance comparing eager and compiled results is too small, perhaps because of a Triton update that changed numerics: ``` Mismatched elements: 25 / 38597376 (0.0%) Greatest absolute difference: 0.015625 at index (3771, 509) (up to 0.01 allowed) Greatest relative difference: 9.375 at index (13687, 48) (up to 0.01 allowed) ``` Change the absolute tolerance from 0.01 to 0.02. Also switch to use `torch.testing.assert_close` which prints out the greatest absolute/relative difference like above when the assert fails. `test_non_contiguous_input_mm_plus_mm` has a different problem, just switching to `torch.testing.assert_close` to be uniform with the other tests. Test commands: ``` python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_mm python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_addmm python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_bmm ``` Internal stress tests pass now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131822 Approved by: https://github.com/shunting314	2024-07-26 17:11:35 +00:00
PyTorch MergeBot	161bb67116	Revert "Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 )" This reverts commit ace6decc9948e434dfe2e253bc28341bb22aa983. Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/clee2000 due to unfortunately the internal pybind update got reverted cc @malfet ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2253147079))	2024-07-26 17:02:56 +00:00
Nikita Shulga	c382fc3fea	[Reland] Fix vulkan builds with missing overrides errors (#131760 ) Followup after https://github.com/pytorch/pytorch/pull/131524 Add note explaining why C10 macros should not be used in that header Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760 Approved by: https://github.com/atalman	2024-07-26 17:01:51 +00:00
Bin Bao	1a2edf6dca	[AOTI] Fix _mm_plus_mm codegen (#131689 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/128474 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131689 Approved by: https://github.com/chenyang78	2024-07-26 16:50:12 +00:00
PyTorch MergeBot	696e83a1da	Revert "TCPStore: fix remote address (#131773 )" This reverts commit 9039131a89a5fdb8746bd86b0a4dd91559821e36. Reverted https://github.com/pytorch/pytorch/pull/131773 on behalf of https://github.com/clee2000 due to broke internal builds D60265883, something about formatter ([comment](https://github.com/pytorch/pytorch/pull/131773#issuecomment-2253123800))	2024-07-26 16:47:57 +00:00
Yidi Wu	404a8ae8f6	[export] fix set_grad x tensor constant. (#131787 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/130379. The original error is verifier finds that the placeholder nodes' meta[''val"] are missing in subgraph of WrapSetGradEnabled hop. In this PR, we fixed it by re-ordering the replace_set_grad_with_hop_pass with lift_constant_tensor pass because only after lift_constant_pass, all the constant attrs start to have meta["val"]. Test Plan: buck2 test test:test_export -- -r "test_setgrad_lifted_tensor" Differential Revision: D60244935 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131787 Approved by: https://github.com/yushangdi	2024-07-26 16:41:59 +00:00
PyTorch MergeBot	bb64702eb3	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 520182dbffe09943be74a8a9cd58618fc171738f. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/clee2000 due to broke internal tests D60265910 ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2253113689))	2024-07-26 16:40:03 +00:00
Alnis Murtovi	d57de73fe0	AutoHeuristic: Add support for kernel choice selection (#131610 ) This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following: When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py. After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results. I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610 Approved by: https://github.com/eellison	2024-07-26 16:35:55 +00:00
PyTorch MergeBot	a38890a53f	Revert "[2/3] 3D Composability - move pp tests (#129801 )" This reverts commit 29571c5c06f6e5fd143d85c18d8a6b87d2e4e1d3. Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](`544f950d14`) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2253099894))	2024-07-26 16:30:29 +00:00
Animesh Jain	13ab92b72d	[dynamo][recompile-logs] Suggest force_parameter_static_shapes on the recompile log for parameter-related recomps (#131825 ) Discovered in https://github.com/pytorch/pytorch/issues/121369 On the user-empathy-day model, the logs look like these ~~~ W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] function: 'auto_repeat_tensors_for_time' (/home/anijain/local/lumiere-pytorch/lumiere_pytorch/lumiere.py:545) W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] last reason: 0/0: len(L['args']) == 1 W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] function: 'forward' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:150) W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] last reason: 11/0: tensor 'L['x']' size mismatch at index 0. expected 16, actual 8 W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] function: 'normalize_weight' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:127) W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] last reason: 40/1: tensor 'L['weight']' size mismatch at index 0. expected 64, actual 16. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] function: 'pack_one' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:38) W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] last reason: 58/1: tensor 'L['t']' stride mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] function: 'torch_dynamo_resume_in_pack_at_70' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/packing.py:70) W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] last reason: 62/0: tensor 'L['tensors'][0]' size mismatch at index 0. expected 16, actual 32. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] function: 'reshape' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/_backends.py:91) W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] last reason: 65/0: tensor 'L['x']' size mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. ~~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131825 Approved by: https://github.com/ezyang ghstack dependencies: #131795, #131801, #131804	2024-07-26 16:25:21 +00:00
Zhengxu Chen	7feaa73057	[export] Remove deprecated fields from ExportedProgram ctor. (#131697 ) Summary: as title. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D60078426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131697 Approved by: https://github.com/ydwu4	2024-07-26 16:19:46 +00:00
PyTorch MergeBot	546df5daf8	Revert "[3/3] 3D Composability - move tp dp tests (#129802 )" This reverts commit ec3829795dfb58a58ebc9ca241f7949efd60bfda. Reverted https://github.com/pytorch/pytorch/pull/129802 on behalf of https://github.com/atalman due to Need to revert https://github.com/pytorch/pytorch/pull/129801 that got remerged ([comment](https://github.com/pytorch/pytorch/pull/129802#issuecomment-2253082995))	2024-07-26 16:19:25 +00:00
cyy	2988d33c80	[3/N] Fix clang-tidy warnings in jit (#131830 ) Follows #131735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131830 Approved by: https://github.com/ezyang	2024-07-26 15:46:28 +00:00
Brian Hirsh	5612408735	_get_operation_overload: dont raise exception when overload does not exist (#131554 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131554 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #131403, #131482, #131665	2024-07-26 15:38:11 +00:00
andrewor14	eba2ffd278	[pt2e][quant] Ensure BN node is erased after convert (#131651 ) Summary: Previously, when folding BN into conv, we rely on DCE to clean up the unused BN node from the graph. This works if the model is already in eval mode, but fails if the model is still in train mode because DCE doesn't remove nodes with potential side effects (in this case `_native_batch_norm_legit`). This required users to move the model to eval mode before calling convert in order to get a properly DCE'd graph. To solve this, we manually erase the BN node after folding instead of relying on DCE. This relaxes the ordering constraints between `move_exported_model_to_eval` and `convert_pt2e`. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node Reviewers: jerryzh168, yushangdi Subscribers: jerryzh168, yushangdi, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651 Approved by: https://github.com/yushangdi	2024-07-26 15:30:45 +00:00
Bin Bao	9440a4824d	[CI][dashboard] Add a workflow to collect A10g perf (#131816 ) Summary: This is an experimental work. Depending on the performance stableness and benchmark coverage on A10g, we may consider to use A10g for manually-triggered per-PR performance comparison instead of exausting expensive A100 instances. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131816 Approved by: https://github.com/huydhn	2024-07-26 14:36:14 +00:00
Dan Zimmerman	535c17efb3	[torch] Implement c10::BFloat16 ctor from __hip_bfloat16 (#131359 ) Summary: Pretty straightfoward. ROCm 6.2.0 changed the `__hip_bfloat16` API (see [this PR](`481912a1fd`)), so we gate impl on `__BF16_HOST_DEVICE__` macro to support older and newer versions of ROCm. Test Plan: CI Differential Revision: D60024830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131359 Approved by: https://github.com/houseroad	2024-07-26 14:30:49 +00:00
Brian Hirsh	e4ace1a396	AOTDispatcher: properly bump version counter on input mutations in inference graphs (#131665 ) This ensures that in an inference setting, we properly bump the VC of mutated graph inputs. Previously, we would only properly bump the VC for training graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131665 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #131403, #131482	2024-07-26 14:22:20 +00:00
Brian Hirsh	5570a0da0a	dont dispatch aten.conj(scalar_tensor) back to python (#131482 ) https://github.com/pytorch/pytorch/issues/105290 The problem in the original flow is that: (1) the user calls `torch.mul(complex_tensor, complex_scalar) (2) python arg parser wraps the complex scalar in a `scalar_tensor`, and dispatches to `aten.mul.Tensor(self, scalar_other)` (3) autograd sees `aten.mul.Tensor`, calls `scalar_other.conj()` [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/FunctionsManual.cpp#L597) (4) during proxy tensor tracing, this gets dispatched to `aten._conj(scalar_tensor)` (5) when we hit __torch_dispatch__, the scalar_tensor is converted back into a plain python scalar (6) we error during tracing, because in `FunctionalTensorMode.__torch_dispatch__` we try to redispatch on `aten._conj.default(plain_python_scalar)`, and this overload does not accept python scalars. My attempted fix in this PR is to update `TensorBase::conj()` to check if the current tensor is a scalar tensor (wrapped number), and if so, manually: (1) convert the scalar tensor back into a scalar (2) call scalar.conj() directly (3) convert the result back into a wrapped tensor This avoids having to go through python entirely in the tracing case (which is fine, because these scalar tensors are constants that we can const-prop during tracing anyway). Notable, I did not add e.g. a new `aten._conj.Scalar` overload. This would not actually fix the problem, since the bug is that we call `aten._conj.default(python_scalar)` directly. we would also need to muck with all `__torch_dispatch__` call sites to know to convert python scalars back into tensors directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131482 Approved by: https://github.com/zou3519, https://github.com/ezyang ghstack dependencies: #131403	2024-07-26 14:22:20 +00:00
Brian Hirsh	8bb9aa93a7	dynamo: mutations on .data should be invisible to autograd (#131403 ) Fixes https://github.com/pytorch/pytorch/issues/121353 our handle for `.data` in dynamo today basically just converts `y = x.data` into `y = x.detach()`. The semantics of these two ops are not quite the same, because: (1) any future mutations on `x.data` will be fully ignored by autograd (2) any mutations on `x.detach()` will bump x's version counter the linked model does a .data mutation that is hidden from autograd in eager, but ends up erroring during AOTDispatcher tracing. I updated dynamo's handling so that: (1) when dynamo sees a call to `getattr(tensor, "data")` and calls `.detach()` we set a flag on the returned `TensorVariable` indicating it came from `.data` (2) on any tensor method that we call with an input `TensorVariable` with this flag turned on, we proxy autograd's `preserve_version_counter` logic into the graph, to properly reset the VC after the op is run. One thing to note is that I don't actually do this on every op that we pass the tensor to: I only do it for tensor methods that appear to be mutations (by checking for a trailing underscore). My thought was that: (1) I didn't want to do this for every op that you pass `y` into, since that will e.g. triple the number of nodes in the graph, and could cause compile time regressions if you use .data (2) this situation is pretty rare in general, and I'm hoping that "tensor method mutations" cover most reasonable mutation cases. If we manage to miss a case, you will get a loud error during tracing anyway, so there is not a safety issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131403 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2024-07-26 14:22:20 +00:00
PyTorch MergeBot	7339c8ab28	Revert "immutable accessors in graph signature (#131807 )" This reverts commit 6fd28fc228f900863d63b1c83912dcc000b084e3. Reverted https://github.com/pytorch/pytorch/pull/131807 on behalf of https://github.com/atalman due to Broke CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10111847569/job/27965364355) [HUD commit link](`608057afe2`) ([comment](https://github.com/pytorch/pytorch/pull/131807#issuecomment-2252875417))	2024-07-26 14:21:12 +00:00
Yanbo Liang	e76e566cfb	[Dynamo] Support zip_longest (#131497 ) Fixes #121348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131497 Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/zou3519	2024-07-26 14:06:10 +00:00
PyTorch MergeBot	c9888c2739	Revert "[BE] typing for decorators - optim/optimizer (#131583 )" This reverts commit a1dad77dfa4e244a867ca7c73e9f6b6fe36a1340. Reverted https://github.com/pytorch/pytorch/pull/131583 on behalf of https://github.com/atalman due to Breaks CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10105959146/job/27947741162) [HUD commit link](`a1dad77dfa`) ([comment](https://github.com/pytorch/pytorch/pull/131583#issuecomment-2252784280))	2024-07-26 13:41:22 +00:00
PyTorch MergeBot	7ee6831ae8	Revert "Fix vulkan builds with missing overrides errors (#131760 )" This reverts commit 7260eaeca056ffa013de769c10a2bfce9505d937. Reverted https://github.com/pytorch/pytorch/pull/131760 on behalf of https://github.com/malfet due to Does not work with internal builds ([comment](https://github.com/pytorch/pytorch/pull/131760#issuecomment-2252783645))	2024-07-26 13:38:28 +00:00
zengxian	d3e932dc10	[CI] Add inductor cpu accuracy test running on AVX2 runners (#128682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-26 13:24:41 +00:00
Huy Do	e73fa28ec8	[CI] Fix arm64 docker build arch (#131869 ) Attempt to fix arm64 docker build arch on https://github.com/pytorch/pytorch/pull/131855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131869 Approved by: https://github.com/desertfire	2024-07-26 13:19:36 +00:00
Peter Bell	608057afe2	[inductor] Fix duplicated range tree codegen in split scan (#131669 ) Looks like in the halide codegen refactor, the range tree codegen was split out from initialize_range_tree into its own function, but triton_split_scan.py wasn't updated to reflect this change. The result was the codegen gets invoked twice which is benign but makes the kernel harder to read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131669 Approved by: https://github.com/Chillee	2024-07-26 13:11:26 +00:00
Bin Bao	945946e817	[AOTI] Fix another ABI-compatible CPU issue (#131798 ) Summary: This problem is seen on AOTI CPU dashboard runs, a cpp compilation error because ConstantHandle::get doesn't exist. This PR adds ConstantHandle::get so that the interface is consistent with RAIIAtenTensorHandle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131798 Approved by: https://github.com/zou3519, https://github.com/chenyang78 ghstack dependencies: #131791	2024-07-26 11:27:58 +00:00
William Wen	7d282d8755	[dynamo] add lazy IteratorVariable implementations for map and zip (#131413 ) Fixes https://github.com/pytorch/pytorch/issues/130750. Repro of lazy/eager `map` discrepancy without `islice`: ```python def fn(a, b): y = 1 def f(x): nonlocal y y += 1 return x l = list(zip([a, b], map(f, [1, 2, 3, 4]))) return a + y ``` The major change is that we implement `MapVariable` and `ZipVariable` based on `IteratorVariable`. Before, `map` and `zip` were being traced by immediately unpacking the result as a `TupleVariable`, which is wrong in cases such as the example above. `MapVariable`s are not allowed to be unpacked while `ZipVariable`s can only be unpacked if all of its iterables can also be unpacked. We also add new `[has_]force_unpack_var_sequence` methods to `VariableTracker` for the case where it is safe to unpack the entire sequence lazily, e.g., when building a list from a map (i.e. `list(map(f, ...))`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131413 Approved by: https://github.com/anijain2305	2024-07-26 10:47:38 +00:00
IvanKobzarev	115994fea2	[aotd] Align partitioner graph output type to tuple (#131759 ) Brian debugged the difference of the output type for inference and train graph. Partitioner sometimes return list output type. After this PR it will always return tuple. Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing. This could be easily fixed with fast-forward fix on: ``` EXPECTTEST_ACCEPT=1 python test/test.py ``` Adding ciflows/periodic to minimize this probability Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-07-26 09:46:29 +00:00
Bin Bao	1e24f7875e	[AOTI] Fix ABI-compatible mode link issue for CPU (#131791 ) Summary: Found this "cannot find -ltorch: No such file or directory" issue when collecting AOTI CPU perf for the dashboard. Debugging on the CI machine revealed two problems: 1) no valid VEC_ISA was picked; 2) when 1 happens, libtorch path is not specified in the linker path. This PR fixes the second problem. A later PR will fix the first problem, but somehow finding the right VEC_ISA causes a performance regression, which needs more investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131791 Approved by: https://github.com/zou3519, https://github.com/chenyang78	2024-07-26 09:02:13 +00:00
Avik Chaudhuri	6fd28fc228	immutable accessors in graph signature (#131807 ) Test Plan: existing tests Differential Revision: D60253955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131807 Approved by: https://github.com/ydwu4	2024-07-26 08:56:19 +00:00
Jiang, Yanbing	bceb91222c	Fix meta error in _convert_weight_to_int4pack (#130915 ) This PR is to fix meta error in _convert_weight_to_int4pack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130915 Approved by: https://github.com/jerryzh168	2024-07-26 08:36:30 +00:00
Avik Chaudhuri	2bf649f5ae	suggested fix for data-dependent error (#125378 ) Suggests fixes for data-dependent errors in non-strict export. Any data-dependent error has an unresolved condition on unbacked symints. A mechanizable strategy for fixing such errors, which this PR enables, is to "bash" them using `torch._check()`s. For each error we suggest using `torch._check()` on the condition or its negation. The user selects and copy-pastes the suggested fix and continues. For example, here's an existing data-dependent error message with the suffix following `<snip>...</snip>` added by this PR: ``` Could not guard on data-dependent expression Eq(u2, u1) (unhinted: Eq(u2, u1)). (Size-like symbols: u1) <snip>...</snip> User code: File "test/export/test_export.py", line 1944, in forward return r.view(items[0], items[2]) Suggested fixes (please choose one of the following): 1. torch._check(items[2] == r.shape[1]) 2. torch._check(items[2] != r.shape[1])" ``` Tests in this PR illustrate this workflow, by taking common examples of data-dependent errors and bashing them until success, purely based on suggested fixes. In particular, we test this workflow on the "puzzlers" in https://www.internalfb.com/intern/anp/view/?id=5330476 (thanks @ezyang). In terms of implementation, we focus on non-strict mode, where we can intercept torch function calls to install a handler that walks up the stack from the error, finding the closest non-torch frame and inspecting its locals for symints appearing in the error. The suggested fixes then access these symints through the local variables so that they can be (a) easily understood by the user (b) directly added to the code. Implementing this idea in strict mode is follow-up work—we have already investigated what it would take, and decided to separate it out of this PR for reasons described next. It's not too hard to map symints to locals in Dynamo (although it needs to happen elsewhere, i.e., intercepting torch function calls won't work). However, unfortunately this doesn't seem to be enough; the graph modules created by Dynamo when going through AOTAutograd can raise further data-dependent errors in some cases, and thus we need yet another mechanism to map symints to locals for graph modules, via captured source-level metadata and FX node walking. This latter component will require some care to build properly, or we might conclude it is altogether unnecessary and fix Dynamo instead. Differential Revision: D56867432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125378 Approved by: https://github.com/ezyang	2024-07-26 08:34:50 +00:00
Adnan Akhundov	fb3ddafbcf	[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820 Approved by: https://github.com/eellison	2024-07-26 08:11:34 +00:00
Janani Sriram	13e806a591	[NestedTensor] Add support for transposed NestedTensors where ragged_idx > 1 for sum and mean operators (#131517 ) Add support for transposed, non-contiguous `NestedTensor`s, where `ragged_idx > 1`, for the aten operators `sum` and `mean`. This diff enables reducing along the jagged dimension for non-contiguous `NestedTensor`s, transposed between non-batch dimensions as well as between a ragged and a non-batch dimension. For example, users can now reduce a `NestedTensor` of shape `(B, M, , N)` along `` or `(B, N, M, )` along ``. Parametrize existing unit tests and add new unit tests verifying the accuracy of implementations on `NestedTensor`s that transpose between 2 non-batch dimensions as well as between a ragged and a non-batch dimension. Differential Revision: [D59847927](https://our.internmc.facebook.com/intern/diff/D59847927/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131517 Approved by: https://github.com/davidberard98	2024-07-26 07:21:32 +00:00
Xuehai Pan	63374dda69	[BE][Easy] explicitly define global constants in `torch.testing._internal.common_utils` (#129826 ) This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM". Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826 Approved by: https://github.com/Skylion007	2024-07-26 06:32:08 +00:00
Boyuan Feng	aebfd3d4de	[CUDAGraph] skip cudagraph if too many distinct sizes (#131387 ) Current implementation records a new cudagraph for every distinct input size. This leads to significant overhead if there are too many distinct input sizes. While we currently hint re-recording cudagraph from dynamic shapes, it is at [info level](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L363-L366) which is easy to overlook and leads to several issues, such as Issue #119640 and Issue #128424. This PR checks the number of cudagraph due to dynamic shapes and warns loudly if #cudagraph exceeds a threshold `cudagraph_dynamic_shape_limit`(=50). Fixes #119640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131387 Approved by: https://github.com/eellison	2024-07-26 06:17:35 +00:00
Boyuan Feng	16d7cb5049	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 06:14:06 +00:00
Yu, Guangye	dfba85c26b	Update torch-xpu-ops pin (ATen XPU implementation) (#131643 ) # Motivation Regular update. 1. Some new ATen ops support 2. ABI=0 build support 3. Remove dispatched implementation of pin_memory&is_pinned 4. Enhance deterministic usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/131643 Approved by: https://github.com/EikanWang	2024-07-26 05:51:58 +00:00
Nikita Shulga	baa93e160f	[MPS] Add native implementation for shift ops (#131813 ) Similar to how AND/OR/XOR ops are implemented TODO: Consider using MPS method calls rather than metal kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/131813 Approved by: https://github.com/manuelcandales	2024-07-26 05:01:20 +00:00
Aaron Orenstein	a1dad77dfa	[BE] typing for decorators - optim/optimizer (#131583 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131583 Approved by: https://github.com/janeyx99 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581, #131582	2024-07-26 05:00:07 +00:00
Aaron Orenstein	8689d377f9	[BE] typing for decorators - signal/windows/windows (#131582 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131582 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581	2024-07-26 05:00:07 +00:00
Aaron Orenstein	dbf7c318b2	[BE] typing for decorators - _refs/nn/functional (#131581 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131581 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580	2024-07-26 05:00:03 +00:00
Aaron Orenstein	81c26ba5ae	[BE] typing for decorators - utils/flop_counter (#131580 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131580 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579	2024-07-26 04:59:58 +00:00
Adnan Akhundov	33069630ce	[inductor] Add type hints to functions in decompositions.py (#131780 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131780 Approved by: https://github.com/eellison	2024-07-26 04:50:23 +00:00
Avik Chaudhuri	5b05ad9697	fix non-persistent buffers (#131756 ) Summary: Dynamo doesn't track whether buffers are `persistent`. This led to some ugly code where we would mark buffers as always persistent when creating signatures, then later check whether the buffers were not in the state dict to infer whether they were non-persistent, and use this to fix up the signature. This PR instead defines a utility to look up all the non-persistent buffers registered inside a module (this information is recorded in a private `_non_persistent_buffers_set` module attribute), and uses it to (a) correctly set the persistent flag on buffers when creating signatures (b) transfer this information to a Dynamo-traced graph module, which then causes non-persistent buffers to (correctly) not show up in the state dict. Test Plan: existing tests + new case with non-persistent buffer in nested module Differential Revision: D60224656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131756 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4	2024-07-26 04:45:30 +00:00
Animesh Jain	a617919541	[dynamo] Do not guard on keys for _forward_hooks and _forward_pre_hooks (#131682 ) Fixes https://github.com/pytorch/pytorch/issues/125836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131682 Approved by: https://github.com/bdhirsh	2024-07-26 04:39:54 +00:00
Xuan Zhang	3d7c424a75	[inductor] update users to buffers instead of scheduler nodes (#131796 ) After a recent refactoring of inductor, `.users` are now associated with buffers instead of scheduler nodes. In `debug.py`, one such usage of `.users` is not updated accordingly, and the change here fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131796 Approved by: https://github.com/yf225	2024-07-26 03:34:26 +00:00
Isuru Fernando	6dbf343936	Fix aten implementation for low memory max_pool2d (#131717 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131717 Approved by: https://github.com/peterbell10	2024-07-26 03:23:16 +00:00
YangQun1	c2f3266c8e	Not remove collective ops in dce since they have side-effect (#131023 ) Fixes #130918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131023 Approved by: https://github.com/yf225	2024-07-26 03:03:32 +00:00
Yu, Guangye	e0d3e4a498	remove unused code for XPU (#131856 ) # Motivation This PR aims to remove unused code in PyTorch for XPU, following https://github.com/pytorch/pytorch/pull/128179 Otherwise, CI will block without this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131856 Approved by: https://github.com/EikanWang	2024-07-26 02:57:12 +00:00
Will Feng	236d055330	[Traceable FSDP2] Add partial-graph (graph-break) unit tests (#131747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131747 Approved by: https://github.com/bdhirsh	2024-07-26 02:51:57 +00:00
PyTorch MergeBot	03f49c9523	Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 )" This reverts commit 16699c7d848fca669865d83ffff205bcbb8665be. Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))	2024-07-26 02:08:45 +00:00
Boyuan Feng	16699c7d84	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 01:40:23 +00:00
Colin Peppler	2ff98bc57f	[inductor][autotune_at_compile_time] fix some codegen-ing for standalone autotuning file (#131726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131726 Approved by: https://github.com/desertfire ghstack dependencies: #131253	2024-07-26 00:58:04 +00:00
PyTorch MergeBot	b343644f3a	Revert "MTIA equivalent of torch.cuda.memory_stats (#131673 )" This reverts commit 513ce5f69a7f53742b7aa5798082dd158beec2ed. Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))	2024-07-26 00:54:37 +00:00
Yanbo Liang	b893a57f96	[Dynamo] Fix guard_on_nn_modules unit tests discrepancy between OSS and fbcode (#131810 ) Fixes Meta internal task: [T195592220](https://www.internalfb.com/intern/tasks/?t=195592220) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131810 Approved by: https://github.com/zou3519	2024-07-26 00:24:46 +00:00
Animesh Jain	246e32055a	[benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804 ) Fixes https://github.com/pytorch/pytorch/issues/121989 We are turning on the flag by default in another PR. But that PR can go through reverts. So, forcibly adding the benchmark to prevent dashboard fluctuation in case of reverts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #131795, #131801	2024-07-26 00:20:42 +00:00
Peter Bell	c92f2a19a4	[BE] Use assertEqual in MultiKernel tests (#127725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127725 Approved by: https://github.com/lezcano ghstack dependencies: #131044, #127724	2024-07-26 00:12:43 +00:00
Peter Bell	9ae288f4be	[inductor] Simplify multi-kernel codegen by unifying kernel args (#127724 ) Persistent kernels are sometimes able to remove intermediate buffers that would otherwise be needed for the non-persistent reduction kernel. This makes multi kernel's codegen more complicated as it needs to drop these extra arguments at runtime after selecting the correct kernel to run. Instead, this PR updates the persistent kernel's `must_keep_buffers` so these aren't dropped during codegen so both kernels have the same signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724 Approved by: https://github.com/shunting314 ghstack dependencies: #131044	2024-07-26 00:12:43 +00:00
PyTorch MergeBot	14920c149b	Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275 )" This reverts commit 0455344777f354dcbbd8e661a46ca2ca20e8a913. Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_quantized_linear_amx_dynamic_shapes_batch_size_16_in_features_4_out_features_64_bias_True_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/10102272826/job/27938970118) [HUD commit link](`0455344777`) not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2251609554))	2024-07-26 00:12:40 +00:00
Tristan Rice	adbe4f5ecf	TCPStore: add better logging on wait timeout (#131808 ) This makes TCPStore `wait` timeout print actually useful info instead of a generic `Socket Timeout` message on timeout. Bonus: * fix weirdness where `connect_timeout` only supported seconds unlike the reset of our timeouts (thus minimum timeout was 1s) * Fixed tests that used a 10s timeout (test_store now only takes 20s instead of 40s) Ex: ``` DistStoreError: wait timeout after 100ms, keys: /the_key ``` Test plan: ``` python test/distributed/test_store.py python test/distributed/test_c10d_gloo.py -v -k timeout ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131808 Approved by: https://github.com/kurman	2024-07-25 23:54:41 +00:00
Brian Hirsh	e9443860e7	add python binding for _get_current_graph_task_keep_graph (#131038 ) Inductor would like a way to have activations that do not escape the backward graph marked as "donated", so we can re-use their memory during memory planning here: https://github.com/pytorch/pytorch/pull/130580 For this to be safe though, we need to know at runtime that autograd does not plan to retain the current autograd graph (either for another call to .backward() later, or if double backward is being used). In the linked PR, the current plan is to error when we detect this situation, and ask the user to turn off the donated buffer config (although if/once we get to the point of always delaying backward compilation to runtime, we can just wait until we know the runtime value to compile). There isn't a way to know if the currently running backward is run with `retain_graph=True` from python - @soulitzer helped me figure out where to grab it so I added a python binding for it under `ctx.is_retain_graph()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131038 Approved by: https://github.com/soulitzer	2024-07-25 23:50:40 +00:00
cyy	eac83479cc	Enable Wunused-function and Wunused-result globally (#131596 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131596 Approved by: https://github.com/zou3519	2024-07-25 23:50:12 +00:00
Animesh Jain	2a4ca5ccc4	[dynamo] Pop the exception stack on handling the StopIteration natively (#131801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131801 Approved by: https://github.com/yanboliang ghstack dependencies: #131795	2024-07-25 23:33:19 +00:00
Animesh Jain	11673851d9	[dynamo][exception][bugfix] Add a pop for < 3.11 version (#131795 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131795 Approved by: https://github.com/yanboliang	2024-07-25 23:33:19 +00:00
Colin Peppler	f885a70fab	[inductor][autotune_at_compile_time] support Triton kernel with sympy fn str arg (#131253 ) ## What is sympy fn str arg? It's a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`) ## Crash ``` torch/_inductor/sizevars.py", line 468, in symbolic_hint expr = self.simplify(expr) # where expr is 'sqrt' torch/_inductor/sizevars.py", line 66, in simplify return sympy.expand(expr).xreplace(self.replacements) sympy/core/function.py", line 2816, in expand return sympify(e).expand(deep=deep, modulus=modulus, **hints) AttributeError: 'function' object has no attribute 'expand' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253 Approved by: https://github.com/desertfire	2024-07-25 23:31:20 +00:00
drisspg	b4b62d3945	update to 2.5.8 (#131684 ) # Summary This stack brings the current fork of FAv2 near the top of main which is 2.6.2 Notably we need to update cutlass to 3.5.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131684 Approved by: https://github.com/jainapurva	2024-07-25 23:15:03 +00:00
Michael Lazos	51f4f87718	[Reland] Ensure staticmethods can be allowed in graph (#131789 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131789 Approved by: https://github.com/anijain2305	2024-07-25 22:54:18 +00:00
wz337	4de85e3c30	[DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636 ) We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes. As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636 Approved by: https://github.com/wanchaol	2024-07-25 22:47:22 +00:00
Aaron Orenstein	79f0c4dc04	[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131579 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578	2024-07-25 22:24:19 +00:00
Aaron Orenstein	c65b197b85	[BE] typing for decorators - _library/custom_ops (#131578 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577	2024-07-25 22:24:19 +00:00
Aaron Orenstein	5ee6a6dacc	[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131577 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576	2024-07-25 22:24:19 +00:00
Aaron Orenstein	37d76c7d48	[BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131576 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575	2024-07-25 22:24:19 +00:00
Aaron Orenstein	42dc5a47a1	[BE] typing for decorators - _inductor/fx_passes/post_grad (#131575 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131575 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574	2024-07-25 22:24:19 +00:00
Aaron Orenstein	b2cbcf710b	[BE] typing for decorators - _inductor/lowering (#131574 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131574 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573	2024-07-25 22:24:19 +00:00
Aaron Orenstein	f0f20f7e97	[BE] typing for decorators - _jit_internal (#131573 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131573 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572	2024-07-25 22:24:19 +00:00
Aaron Orenstein	bfe0079b72	[BE] typing for decorators - _meta_registrations (#131572 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131572 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571	2024-07-25 22:24:19 +00:00
Aaron Orenstein	4b985e6f80	[BE] typing for decorators - distributed/_tensor/ops/utils (#131571 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131571 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570	2024-07-25 22:24:19 +00:00
Aaron Orenstein	5731b486c8	[BE] typing for decorators - library (#131570 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131570 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569	2024-07-25 22:24:19 +00:00
Aaron Orenstein	aa58af8b43	[BE] typing for decorators - masked/_ops (#131569 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131569 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568	2024-07-25 22:24:19 +00:00
Aaron Orenstein	193f62fde9	[BE] typing for decorators - fx/_compatibility (#131568 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568 Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519	2024-07-25 22:24:19 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
Animesh Jain	0455344777	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang ghstack dependencies: #131744	2024-07-25 22:14:17 +00:00
Simon Mahns	513ce5f69a	MTIA equivalent of torch.cuda.memory_stats (#131673 ) Summary: Adding MTIA equivalent of `torch.cuda.memory_stats` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673 Approved by: https://github.com/egienvalue	2024-07-25 21:59:59 +00:00
Tristan Rice	9039131a89	TCPStore: fix remote address (#131773 ) This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. Test plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman	2024-07-25 21:55:25 +00:00
Xu Han	520182dbff	[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. 2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access. 3. Add `TODO` comments for further some Meta employee help on contine to do this work. 4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-25 21:45:40 +00:00
Yanbo Liang	a34692c0a3	[Inductor] Added and_masks and or_masks utilities & make fully masked out rows 0 instead of nan (#131552 ) Combine #131073 and #131012 and fix doc building failures. Co-authored-by: chilli <chilli@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131552 Approved by: https://github.com/Chillee	2024-07-25 21:29:46 +00:00
Shengbao Zheng	89bdd9c18f	[kineto] populate src/dst rank for p2p (#130812 ) Summary: as title populate src/dst rank (global rank) for p2p kernel Differential Revision: D59794535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130812 Approved by: https://github.com/aaronenyeshi	2024-07-25 21:10:57 +00:00
Wanchao Liang	1c58aacbc8	[dtensor] move ops to private (#131211 ) as titled Differential Revision: [D60132519](https://our.internmc.facebook.com/intern/diff/D60132519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131211 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #131212	2024-07-25 20:59:55 +00:00
Jon Janzen	605dfd8fb4	Switch sync_distributed_folder to use non-reverse order (#131683 ) `git` on GHA seems to use the reverse commit ordering that I see locally O_o Pull Request resolved: https://github.com/pytorch/pytorch/pull/131683 Approved by: https://github.com/seemethere	2024-07-25 20:44:23 +00:00
PyTorch MergeBot	fe2e6f0c51	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit dfc9bfc8839ea3a0ffe933a64cd129fab5e4da75. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/atalman due to Breask CI test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10099725941/job/27930133346) [HUD commit link](`2c1851f04e`) ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2251360224))	2024-07-25 20:44:04 +00:00
James Wu	1ad4e6f228	Refactor cudagraphs to use serializable placeholder info (#130252 ) This PR refactors placeholders in cudagraphs to be serializable. We define a new PlaceholderInfo object which only has the necessary parts of placeholders for logging/debugging, and use that instead of `torch.fx.Node` directly. This allows us to then save PlaceholderInfo into the FXGraphCache/AOTAutogradCache later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130252 Approved by: https://github.com/eellison, https://github.com/masnesral ghstack dependencies: #129384	2024-07-25 20:39:37 +00:00
eqy	69d63b2318	[CUDA][Pooling] Clean up unused `accscalar_t` in `maxpool2d` forward (#131728 ) maxpool forward doesn't actually do any accumulation and the second template param was just a dupe of the first Pull Request resolved: https://github.com/pytorch/pytorch/pull/131728 Approved by: https://github.com/mikaylagawarecki	2024-07-25 20:32:42 +00:00
Peter Bell	fdc4d6fe96	[inductor] Refactor fusion of inplace operations (#130835 ) Resubmit of #128979 `WeakDep`s force readers to have completed before a mutation overwrites the buffer, but we want to allow fusions to occur for inplace mutations where the same index is read and written. Currently this is achieved by: 1. Identifying the buffers used by the mutating op in its `dep_closure` 2. Not creating `WeakDep`s for buffers in the `dep_closure` 3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical` So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup. This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to `can_fuse_vertical` which selectively allows inplace operation to fuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835 Approved by: https://github.com/lezcano	2024-07-25 20:29:01 +00:00
Zain Rizvi	61d7bb3e79	Migrate trunk workflows to Amazon2023 ami (#131677 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 All migrated trunk jobs passed successfully Pull Request resolved: https://github.com/pytorch/pytorch/pull/131677 Approved by: https://github.com/malfet	2024-07-25 20:19:16 +00:00
James Wu	a6ebd56f7b	Factor out cudagraph post compile into its own function (#129384 ) Moves cudagraphs stuff into a post_compile function that I can later call when loading from AOTAutogradCache. On a cache hit, we only need to save any reasons for disabling cudagraphs along with some metadata needed to run cudagraphify. The arguments to cudagraphs_post_compile should be the set of parameters I'll need to reconstruct on a warm start. No actual behavioral change should result from this: I'm moving the behavior into separate functions, but every operation should be the same pre and post PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129384 Approved by: https://github.com/eellison	2024-07-25 20:15:44 +00:00
IvanKobzarev	58b8704f28	[aot] Keep backward mutations in backward (#129130 ) https://github.com/pytorch/pytorch/issues/127561 Mutations of inputs in backward are emitted manually, after joint_fn tracing. With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs. To keep those mutations in backward: - Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward" In partitioner: Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130 Approved by: https://github.com/Chillee	2024-07-25 20:02:25 +00:00
ashwani	6c31e02971	Fixes the example for `convert_conv3d_weight_memory_format` (#131742 ) Fixes #129158 Please let me know if changes are needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/131742 Approved by: https://github.com/albanD	2024-07-25 20:01:44 +00:00
Animesh Jain	fba24252bd	[dynamo][frame summary] Skip frame summary for frames from inside torch/nn/modules (#131744 ) This ensures that the stack trace points to the user code. At main (no inlining) ![image](https://github.com/user-attachments/assets/bf6f1f46-2dfe-45a2-95e1-fb733cda7e50) With inlining but without this PR ![image](https://github.com/user-attachments/assets/fcb16c4d-dd81-4e5d-a63a-391a73683deb) With inlining and this PR ![image](https://github.com/user-attachments/assets/69f10f65-c2ed-4179-acd5-a2824615129c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131744 Approved by: https://github.com/ezyang	2024-07-25 19:30:03 +00:00
Prachi Gupta	a1fad03fa8	[ROCm] Enable cudagraph expandable segments UTs in inductory/dynamo (#131111 ) Test runtimes extracted from CI logs are as follows. "linux-focal-rocm6.1-py3.8": "dynamo/test_cudagraphs_expandable_segments": 3.3185000000000002, "inductor/test_cudagraph_trees_expandable_segments": 153.233, Pull Request resolved: https://github.com/pytorch/pytorch/pull/131111 Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/peterbell10	2024-07-25 19:26:04 +00:00
Yueming Hao	8c4683c978	Add device argument to the large_grid unit test (#131702 ) Missing device argument lets this unit test only run on CPUs. Two unit tests added in the previous PR https://github.com/pytorch/pytorch/pull/127448. But only one use `device=self.device` to make sure the tests run on correct devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131702 Approved by: https://github.com/desertfire	2024-07-25 19:19:56 +00:00
Matthew Hoffman	bf6aae1468	Improve `torch.masked.mean` and `torch.masked._std_var` scaling (#131293 ) Fixes #131292 Using `new_ones` is expensive and unnecessary. Before: ![21232fda-366a-47ea-a017-15a35cd51d0c](https://github.com/user-attachments/assets/779830f0-0027-4fab-a9e6-b99954c80bc5) After: ![aad2dfcc-52c9-4046-86ab-122b044fa19c](https://github.com/user-attachments/assets/810711c5-c4f0-4b6b-91dc-9a9e714f6ee0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131293 Approved by: https://github.com/ezyang	2024-07-25 18:52:59 +00:00
Yidi Wu	2c1851f04e	[export] fix output node's meta (#131706 ) Summary: This pr fixes all the places in strict export stack where the output node's meta is not preserved correctly. However, we're getting a new error for the test we intend to fix: `buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"`: The `get_attr` nodes has wrong metadata. I guess there are more things need to be fixed to get it working but it's beyond the scope of this PR. Test Plan: buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle" Differential Revision: D60198221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131706 Approved by: https://github.com/yushangdi	2024-07-25 18:44:21 +00:00
Xu Han	dfc9bfc883	[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. 2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access. 3. Add `TODO` comments for further some Meta employee help on contine to do this work. 4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-25 18:34:08 +00:00
PyTorch MergeBot	f3df7deab8	Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431 )" This reverts commit e9db1b059733a02e1fb726d22a0489471044ad98. Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))	2024-07-25 18:00:46 +00:00
William Wen	2423d89d0c	[dynamo] mirror training flag in OptimizedModule (#131546 ) Fixes https://github.com/pytorch/pytorch/issues/122414. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131546 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-07-25 17:43:09 +00:00
PyTorch MergeBot	c3679bed35	Revert "Fix py codegen to delete values that don't have any users (#131028 )" This reverts commit 91aba7baac3d2a079c0b13db25588842260c98cc. Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_functionalize [GH job link](https://github.com/pytorch/pytorch/actions/runs/10094659640/job/27915271250) [HUD commit link](`91aba7baac`) ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2251058374))	2024-07-25 17:42:18 +00:00
mori360	ec3829795d	[3/3] 3D Composability - move tp dp tests (#129802 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802 Approved by: https://github.com/fduwjj ghstack dependencies: #129800, #129801	2024-07-25 16:36:55 +00:00
mori360	29571c5c06	[2/3] 3D Composability - move pp tests (#129801 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801 Approved by: https://github.com/wconstab ghstack dependencies: #129800	2024-07-25 16:36:55 +00:00
Yidi Wu	75c4176b05	[export][BE] consolidate export and export_for_training (#131496 ) Summary: This PR consolidates the implementation of export and export_for_training to maximize code re-use. Also add some type annotations and comments in the code for better readability. Test Plan: Existing tests. Differential Revision: D60130515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131496 Approved by: https://github.com/avikchaudhuri, https://github.com/pianpwk	2024-07-25 16:35:16 +00:00
Shangdi Yu	6bc8db1d32	Rename is_training flag to have more information (#131618 ) Summary: rename is_training flag into dispatch_tracing_mode = “make_fx” or “aot_export” Test Plan: OSS CI Differential Revision: D60154327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131618 Approved by: https://github.com/ydwu4	2024-07-25 16:29:55 +00:00
angelayi	f063027d54	[aoti] Fix constant inputs passed to aoti (#131594 ) In cases where the program takes in a constant, export will specialize on the constant and embed the constant into the graph, with the graph containing a placeholder node with no users. However, inductor errors further down as typically in torch.compile, these constants don't show up as inputs. Since these constants are already embedded in the graph, we will just ignore these inputs while compiling with AOTI, and filter out the non-tensor inputs during the runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131594 Approved by: https://github.com/desertfire	2024-07-25 16:22:15 +00:00
Yidi Wu	ffc6bf8149	[dynamo] lazily guard and specialize on the symint when used in f-string. (#131529 ) Fixes https://github.com/pytorch/pytorch/issues/103602. This PR implements the idea of "if someone creates a string and then ends up not using it, we would prefer to NOT have specialized." mentioned in above issue. Specifically, we create a lazy variable tracker instead of ConstantVariable when we're in FORMAT_VALUE, and when the lazy variable tracker is realized (i.e. it's going to be used), we create a ConstantVariable and the specialization/guarding happens at the time of realization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131529 Approved by: https://github.com/ezyang	2024-07-25 16:16:34 +00:00
Sherlock Huang	96e8df6a3a	[ts_converter] Support prim::max and prim::if with multiple outputs (#131593 ) Summary: As title. Test Plan: test_converter.py Reviewed By: angelayi Differential Revision: D60147455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131593 Approved by: https://github.com/ydwu4	2024-07-25 16:13:31 +00:00
cyy	b07ea91c4c	[2/N] Fix clang-tidy warnings in jit (#131735 ) Follows #131034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131735 Approved by: https://github.com/ezyang	2024-07-25 15:56:53 +00:00
PyTorch MergeBot	49a8e061b6	Revert "Support IPC for Expandable Segments (#130890 )" This reverts commit 0e71a88f9b2ca6b950c76a061791559cdd8a8870. Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to some internal tests show shutdown issues with the change to the table that holds ipc handles ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2250767280))	2024-07-25 15:54:57 +00:00
cyy	a4be5cb50e	Simplify some c++ code (#131612 ) The simplifications were discovered by static analysis tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131612 Approved by: https://github.com/ezyang	2024-07-25 15:07:37 +00:00
Mikayla Gawarecki	c3d099ddd1	[BE][Easy] Add hooks to doc for Optimizer base class (#131628 ) Happened to notice this was missing from the base class (but is rendering for the other optimizers like Adam etc.) when I wanted to link the state_dict hooks for https://discuss.pytorch.org/t/global-not-per-param-optimizer-state/206769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131628 Approved by: https://github.com/janeyx99	2024-07-25 15:07:08 +00:00
Bin Bao	745b55d14a	[CI][dashboard] Add a workflow to collect aarch64 perf (#131729 ) Summary: as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/131729 Approved by: https://github.com/huydhn	2024-07-25 14:58:47 +00:00
Howard Huang	1eedb0a962	fix torchrun log message (#131652 ) fixes https://github.com/pytorch/pytorch/issues/131461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131652 Approved by: https://github.com/awgu	2024-07-25 14:50:10 +00:00
Julia Guo	d0e2ab617d	Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch (#129022 ) Migration of Docker conda builds to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml Related to: https://github.com/pytorch/builder/issues/1849 Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR. Test when executing on PR, upload to ecr: https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327 ``` 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b ``` Test With-Push, upload to dockerhub: https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427 ``` docker.io/pytorch/conda-builder:cpu done ``` Will upload here: https://hub.docker.com/r/pytorch/conda-builder/ Test using ecr image in the nightly workflow: https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87 Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch. Cleanup of workflows and scripts from builder repo: https://github.com/pytorch/builder/pull/1923 Co-authored-by: atalman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129022 Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet, https://github.com/chuanqi129	2024-07-25 14:36:15 +00:00
Aaron Orenstein	4a5a87168e	[BE] typing for decorators - _prims_common/wrappers (#131567 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131567 Approved by: https://github.com/oulgen, https://github.com/zou3519	2024-07-25 14:35:13 +00:00
Nikita Shulga	7260eaeca0	Fix vulkan builds with missing overrides errors (#131760 ) Followup after https://github.com/pytorch/pytorch/pull/131524 Also, use `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED` macro to suppress existing warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760 Approved by: https://github.com/atalman	2024-07-25 14:29:44 +00:00
Aaron Enye Shi	fddb1bcdea	[CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator (#130964 ) Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator. Test Plan: CI, resnet run, and FBR model. Differential Revision: D59703213 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964 Approved by: https://github.com/zdevito	2024-07-25 14:06:52 +00:00
Justin Chu	c88c90a897	[TS2E] Improve logging (#131711 ) Serializing the text without having to do so can be costly for large outputs like ExportedProgram Pull Request resolved: https://github.com/pytorch/pytorch/pull/131711 Approved by: https://github.com/ydwu4	2024-07-25 13:40:10 +00:00
Jiong Gong	316c0d3e6b	[inductor][cpp][gemm] support k slicing for static shapes (#130821 ) This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by `local_buf_ptrs` which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output `Y`. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible. The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default. Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough. Without k-slicing AUTOTUNE linear_unary(64x4096, 64x4096, 64) cpp_packed_gemm_0 0.0108 ms 100.0% _linear_pointwise 0.0431 ms 25.1% AUTOTUNE linear_unary(64x10000, 64x10000, 64) cpp_packed_gemm_0 0.0272 ms 100.0% _linear_pointwise 0.0892 ms 30.5% AUTOTUNE linear_unary(64x20000, 64x20000, 64) cpp_packed_gemm_0 0.0781 ms 100.0% _linear_pointwise 0.1693 ms 46.1% With k-slicing: AUTOTUNE linear_unary(64x4096, 64x4096, 64) cpp_packed_gemm_0 0.0260 ms 100.0% _linear_pointwise 0.0444 ms 58.5% AUTOTUNE linear_unary(64x10000, 64x10000, 64) cpp_packed_gemm_0 0.0275 ms 100.0% _linear_pointwise 0.0893 ms 30.8% AUTOTUNE linear_unary(64x20000, 64x20000, 64) cpp_packed_gemm_0 0.0284 ms 100.0% _linear_pointwise 0.1686 ms 16.8% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130821 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #131024	2024-07-25 13:36:38 +00:00
PyTorch MergeBot	d962dba0c4	Revert "[2/3] 3D Composability - move pp tests (#129801 )" This reverts commit 84cd062fb25c6da7d33b559c28afa38420e64415. Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](`544f950d14`) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2250326191))	2024-07-25 13:30:56 +00:00
Jane Xu	9c4cf866c2	Adafactor forloop basic impl (#129905 ) #109581 At this point, the vanilla implementation (the default) is good. Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor Specifically, the impl in this PR, which attempts to replicate the paper, ``` optim = torch.optim.Adafactor([weight]) ``` is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor ``` optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False) ``` is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor ``` optim = keras.optimizers.Adafactor(learning_rate=0.01) ``` The three results respectively for the same randomly generated weights: ``` # ours tensor([[ 0.3807594, -0.3912092], [ 0.0762539, 0.5377805], [ 0.2459473, 0.4662207]]) # pytorch-optimizer tensor([[ 0.3807592, -0.3912172], [ 0.0762507, 0.5377818], [ 0.2459457, 0.4662213]]) # keras array([[ 0.38076326, -0.39121315], [ 0.0762547 , 0.5377859 ], [ 0.24594972, 0.46622536]], dtype=float32) ``` This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences: * keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01. * We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling. <details> Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing My script repro: ``` import torch from pytorch_optimizer import AdaFactor torch.set_printoptions(precision=7) weight = torch.tensor([[ 0.37697506, -0.39500135], [ 0.07246649, 0.53399765], [ 0.24216151, 0.46243715]], dtype=torch.float32) # bias = torch.tensor([0, 0], dtype=torch.float32) weight.grad = torch.tensor([[-0.5940447, -0.7743838], [-0.5940447, -0.7743838], [-0.5940447, -0.7743838]], dtype=torch.float32) # bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32) weight_c = weight.clone() weight_c.grad = weight.grad.clone() optim = torch.optim.Adafactor([weight]) optim.step() print(weight) optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False) optim_c.step() print(weight_c) ``` <details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905 Approved by: https://github.com/albanD	2024-07-25 13:17:19 +00:00
Patryk Merchelski	e8956c9fe6	Allow cpu scalar to be moved to HPU in masked_fill_decomposition (#127871 ) Extension of the condition allowing the cpu scalar to be moved to specific devices. This fixes an HPU specific error: `torch._dynamo.exc.BackendCompilerFailed: backend='aot_hpu_training_backend' raised: RuntimeError: Expected `value` to be on same device as `a`While executing %masked_fill : [num_users=1] = call_method[target=masked_fill](args = (%matmul, %expand_as, %tensor), kwargs = {})` On the HPU in eager mode the problem doesn't occur because the pytorch's implementation is not used then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127871 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-07-25 13:04:55 +00:00
YangQun1	91aba7baac	Fix py codegen to delete values that don't have any users (#131028 ) Fixes #131025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028 Approved by: https://github.com/ezyang	2024-07-25 13:04:23 +00:00
Peter Bell	2784b3f1b7	[inductor] Fix split-scan interaction with multi-kernel (#131044 ) This fixes a couple errors that come up when multi-kernel is used with split-scan. 1. The split-scan was being marked as a persistent kernel, which allowed a multi-kernel to be created but this isn't supported. Fix is to never mark split-scan as persistent. 2. Benchmark codegen was not handling WorkspaceArg, and would raise a KeyError during codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044 Approved by: https://github.com/shunting314	2024-07-25 11:36:36 +00:00
Xuehai Pan	c04f70bb30	[BE] enable UFMT for `torch/ao/` (#128864 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128864 Approved by: https://github.com/ezyang	2024-07-25 11:30:14 +00:00
Xuehai Pan	434f60ce33	Refactor nightly checkout tool (#131134 ) Changes: - Add `-C REPO` in `git` commands to allow the tool can be run everywhere not only the repo dir - Use `pathlib.Path` as many as possible - Replace `subprocess.run(..., check=True)` with `subprocess.check_{call,output}(...)` - Add `encoding='utf-8'` for files Pull Request resolved: https://github.com/pytorch/pytorch/pull/131134 Approved by: https://github.com/ezyang	2024-07-25 11:20:43 +00:00
Xuehai Pan	054d214c50	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-25 10:10:58 +00:00
Anshul Sinha	c4bf4005d1	[dtensor][debug] adding new noise level which allows users to only print operations with dtensors (#131592 ) Summary I have added a new noise level between the existing levels of 1 and 2, such that the noise level controls are now: 0. prints module-level collective counts 1. prints dTensor operations not included in trivial operations (new noise level) 2. prints operations not included in trivial operations 3. prints all operations This gives the user more flexibility in controlling what information they want to use. The noise levels are used both for creating the console/file log and the json dump. In the example file, I have changed the module_tracing examples to noise level 0 and have changed my transformer examples to show off the new noise level. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/131592 Approved by: https://github.com/XilunWu ghstack dependencies: #131419, #130996	2024-07-25 06:54:57 +00:00
Adnan Akhundov	41e9f9cb7c	[inductor] Fix flaky tests in test_select_algorithm.py (#131709 ) Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_select_algorithm.py`. Test Plan: Tested internally. Differential Revision: D60202778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131709 Approved by: https://github.com/eellison	2024-07-25 06:42:57 +00:00
Adnan Akhundov	3afdbecb23	[inductor] Fix flaky tests in test_debug_trace.py (#131722 ) Summary: When run internally in multiple parallel processes, the `test_debug_trace` hits the cache and skips writing all the expected outputs. Here we force-disable inductor cache to circumvent the problem. Ideally, we should switch to using a cleaner `fresh_inductor_cache` decorator approach, but it doesn't work at the moment. Additionally, the debug trace dir is now generated by `tempfile.mkdtemp` to avoid a (rather unlikely) race condition. Test Plan: Tested internally. Differential Revision: D60207586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131722 Approved by: https://github.com/eellison	2024-07-25 05:56:01 +00:00
Yunqiu Guo	059f9fb30b	[BE][inductor] Type annotate `codecache.py` and `config.py` (#131427 ) As title. Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-07-25 05:54:38 +00:00
Xuehai Pan	ace6decc99	Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 ) Fix static `py::object`s with `py::gil_safe_call_once_and_store`. The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault. ```c++ void func() { static py::object obj = py::module_::import("foo").attr("bar"); ... } ``` The correct code is to use raw pointers rather than the instance. ```c++ void func() { static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")}; py::object obj = *obj_ptr; ... } ``` This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely. ```c++ void func() { PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage; py::object obj = storage .call_once_and_store_result( []() -> py::object { return py::module_::import("foo").attr("bar"); } ) .get_stored(); ... } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-25 05:53:09 +00:00
Adnan Akhundov	59ef88ea5b	[inductor] Fix flaky tests in test_pad_mm (#131699 ) Summary: When run internally, some tests in `test_pad_mm.py` requiring big enough GPU to run `max_autotune=True` fail, as they're getting a smaller GPU than they need. Here we add `skipTest`s to skip the tests in these (rare) circumstances. Differential Revision: D60192586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131699 Approved by: https://github.com/chenyang78, https://github.com/shunting314, https://github.com/eellison	2024-07-25 05:46:45 +00:00
Adnan Akhundov	ee996cd63c	[inductor] Fix flaky tests in test_benchmark_fusion.py (#131733 ) Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_benchmark_fusion.py`. Test Plan: Tested internally. Differential Revision: D60211793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131733 Approved by: https://github.com/oulgen	2024-07-25 05:39:14 +00:00
Xuehai Pan	42a4df9447	Support CUDA nightly package in `tools/nightly.py` (#131133 ) Add a new option `--cuda` to `tools/nightly.py` to pull the nightly packages with CUDA support. ```bash # installs pytorch-nightly with cpuonly tools/nightly.py pull # The following only available on Linux and Windows # installs pytorch-nightly with latest CUDA we support tools/nightly.py pull --cuda # installs pytorch-nightly with CUDA 12.1 tools/nightly.py pull --cuda 12.1 ``` Also add targets in `Makefile` and instructions in constribution guidelines. ```bash # setup conda environment with pytorch-nightly make setup-env # setup conda environment with pytorch-nightly with CUDA support make setup-env-cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131133 Approved by: https://github.com/ezyang	2024-07-25 05:33:52 +00:00
Adnan Akhundov	ceab3121de	[inductor] Fix flaky tests in test_memory_planning.py (#131703 ) Summary: Internally, the ABI-compatible mode is [enabled by default](`eb54ca7abe/torch/_inductor/config.py (L53)`). As a result, when the `abi_compatible: False` flag is not specified explitictly in the tests assuming non-ABI-compatible C++ codegen, those are failing internally. Here we fix one such test in `test_memory_planning.py`. Test Plan: Tested internally. Differential Revision: D60197327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131703 Approved by: https://github.com/eellison	2024-07-25 05:09:08 +00:00
cyy	35bb0d3638	Fix unsigned type bug in CUDACachingAllocator.cpp (#131464 ) curr_block->size and block_state.size are both size_t, so once they are not equal, split will happen. According to the comment, it's better to use '>' Pull Request resolved: https://github.com/pytorch/pytorch/pull/131464 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-25 04:48:05 +00:00
Yanbo Liang	5f3f14e5e4	[BE] Annotate subgraph_lowering (#131545 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131545 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2024-07-25 04:35:26 +00:00
Jun Luo	00e19ae97a	[MTIA] Support module.mtia() (#131499 ) Summary: Following other device backends' implementation to support module.mtia() API. Test Plan: OSS and Internal CIs. Differential Revision: D60076584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131499 Approved by: https://github.com/mikaylagawarecki	2024-07-25 04:23:48 +00:00
Xuehai Pan	2ce734cee9	[BE] enable UFMT for `torch/ao/quantization/` (#128863 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128863 Approved by: https://github.com/ezyang ghstack dependencies: #128861, #128862	2024-07-25 04:17:54 +00:00
Michael Lazos	a2f6eb33d0	Register buffer in static input test (#131686 ) Previously, without nn module inlining, dynamo would lift all tensor attributes on an nn module to be constant on the graph. With nn module inlining these need to be buffers explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131686 Approved by: https://github.com/anijain2305	2024-07-25 03:47:56 +00:00
cyy	62704db5c3	[Distributed] [10/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d/control_plane (#131671 ) Follows #130109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131671 Approved by: https://github.com/zou3519	2024-07-25 03:46:55 +00:00
dependabot[bot]	2d7c135757	Bump setuptools from 69.5.1 to 70.0.0 in /tools/build/bazel (#130893 ) Bumps [setuptools](https://github.com/pypa/setuptools) from 69.5.1 to 70.0.0. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p> <blockquote> <h1>v70.0.0</h1> <h2>Features</h2> <ul> <li>Emit a warning when <code>[tools.setuptools]</code> is present in <code>pyproject.toml</code> and will be ignored. -- by :user:<code>SnoopJ</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4150">#4150</a>)</li> <li>Improved <code>AttributeError</code> error message if <code>pkg_resources.EntryPoint.require</code> is called without extras or distribution Gracefully "do nothing" when trying to activate a <code>pkg_resources.Distribution</code> with a <code>None</code> location, rather than raising a <code>TypeError</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4262">#4262</a>)</li> <li>Typed the dynamically defined variables from <code>pkg_resources</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4267">#4267</a>)</li> <li>Modernized and refactored VCS handling in package_index. (<a href="https://redirect.github.com/pypa/setuptools/issues/4332">#4332</a>)</li> </ul> <h2>Bugfixes</h2> <ul> <li>In install command, use super to call the superclass methods. Avoids race conditions when monkeypatching from _distutils_system_mod occurs late. (<a href="https://redirect.github.com/pypa/setuptools/issues/4136">#4136</a>)</li> <li>Fix finder template for lenient editable installs of implicit nested namespaces constructed by using <code>package_dir</code> to reorganise directory structure. (<a href="https://redirect.github.com/pypa/setuptools/issues/4278">#4278</a>)</li> <li>Fix an error with <code>UnicodeDecodeError</code> handling in <code>pkg_resources</code> when trying to read files in UTF-8 with a fallback -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4348">#4348</a>)</li> </ul> <h2>Improved Documentation</h2> <ul> <li>Uses RST substitution to put badges in 1 line. (<a href="https://redirect.github.com/pypa/setuptools/issues/4312">#4312</a>)</li> </ul> <h2>Deprecations and Removals</h2> <ul> <li> <p>Further adoption of UTF-8 in <code>setuptools</code>. This change regards mostly files produced and consumed during the build process (e.g. metadata files, script wrappers, automatically updated config files, etc..) Although precautions were taken to minimize disruptions, some edge cases might be subject to backwards incompatibility.</p> <p>Support for <code>"locale"</code> encoding is now <strong>deprecated</strong>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4309">#4309</a>)</p> </li> <li> <p>Remove <code>setuptools.convert_path</code> after long deprecation period. This function was never defined by <code>setuptools</code> itself, but rather a side-effect of an import for internal usage. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p> </li> <li> <p>Remove fallback for customisations of <code>distutils</code>' <code>build.sub_command</code> after long deprecated period. Users are advised to import <code>build</code> directly from <code>setuptools.command.build</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p> </li> <li> <p>Removed <code>typing_extensions</code> from vendored dependencies -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4324">#4324</a>)</p> </li> <li> <p>Remove deprecated <code>setuptools.dep_util</code>. The provided alternative is <code>setuptools.modified</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`5cbf12a9b6`"><code>5cbf12a</code></a> Workaround for release error in v70</li> <li><a href="`9c1bcc3417`"><code>9c1bcc3</code></a> Bump version: 69.5.1 → 70.0.0</li> <li><a href="`4dc0c31644`"><code>4dc0c31</code></a> Remove deprecated <code>setuptools.dep_util</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</li> <li><a href="`6c1ef5748d`"><code>6c1ef57</code></a> Remove xfail now that test passes. Ref <a href="https://redirect.github.com/pypa/setuptools/issues/4371">#4371</a>.</li> <li><a href="`d14fa0162c`"><code>d14fa01</code></a> Add all site-packages dirs when creating simulated environment for test_edita...</li> <li><a href="`6b7f7a18af`"><code>6b7f7a1</code></a> Prevent <code>bin</code> folders to be taken as extern packages when vendoring (<a href="https://redirect.github.com/pypa/setuptools/issues/4370">#4370</a>)</li> <li><a href="`69141f69f8`"><code>69141f6</code></a> Add doctest for vendorised bin folder</li> <li><a href="`2a53cc1200`"><code>2a53cc1</code></a> Prevent 'bin' folders to be taken as extern packages</li> <li><a href="`720862807d`"><code>7208628</code></a> Replace call to deprecated <code>validate_pyproject</code> command (<a href="https://redirect.github.com/pypa/setuptools/issues/4363">#4363</a>)</li> <li><a href="`96d681aa40`"><code>96d681a</code></a> Remove call to deprecated validate_pyproject command</li> <li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v69.5.1...v70.0.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=69.5.1&new-version=70.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130893 Approved by: https://github.com/kit1980	2024-07-25 03:32:08 +00:00
Manuel Candales	d6115439be	[MPS] Add SDPA implentation (#131362 ) This work is based off @malfet's #119200 Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362 Approved by: https://github.com/kimishpatel	2024-07-25 03:24:37 +00:00
cyy	d98d00487d	[2/N] Remove unused variables (#131468 ) Follows #122496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131468 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-25 03:08:07 +00:00
cyy	538258bc13	[1/N] Fix clang-tidy warnings in jit (#131034 ) Some some tidy warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/131034 Approved by: https://github.com/ezyang	2024-07-25 03:03:46 +00:00
cyy	46e42ae85d	[4/N] Fix Wunused-parameter warnings (#131291 ) Follows #131271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131291 Approved by: https://github.com/ezyang	2024-07-25 02:59:22 +00:00
Xuehai Pan	03979a599e	[BE] enable UFMT for `torch/ao/pruning/` (#128862 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128862 Approved by: https://github.com/ezyang ghstack dependencies: #128861	2024-07-25 02:49:35 +00:00
Xuehai Pan	973a1362b9	[BE] enable UFMT for `torch/ao/nn/` (#128861 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128861 Approved by: https://github.com/ezyang	2024-07-25 02:49:19 +00:00
Animesh Jain	c047bddbca	[easy][dynamo] Update test for inline_inbuilt_n_modules (#131718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131718 Approved by: https://github.com/williamwen42, https://github.com/mlazos ghstack dependencies: #131694	2024-07-25 02:49:16 +00:00
Animesh Jain	01bc2a8165	[inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694 ) Related issue https://github.com/pytorch/pytorch/issues/131693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694 Approved by: https://github.com/eellison	2024-07-25 02:49:16 +00:00
Xuehai Pan	b5c006acac	[BE][Easy] enable UFMT for `torch/nn/` (#128865 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128865 Approved by: https://github.com/ezyang	2024-07-25 02:48:42 +00:00
cyy	8ea4c72eb2	[1/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#130798 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130798 Approved by: https://github.com/ezyang	2024-07-25 02:36:43 +00:00
angelayi	ab609d6aa6	[ts_convert] Update conversion for aten.tensor (#131549 ) Fixes aten::tensor issues in edgeml models P1492137675 \| suite \| #models \| #has_ts_model \| #has_sample_inputs \| #ts_can_run \| #can_convert \| #ep_result_correct \| #can_package \| #sigmoid_can_run \| #sigmoid_result_correct \| \|---------\|-----------\|-----------------\|----------------------\|---------------\|----------------\|----------------------\|----------------\|--------------------\|---------------------------\| \| EDGEML \| 34 \| 25 \| 23 \| 21 \| 2 \| 2 \| 2 \| 2 \| 2 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/131549 Approved by: https://github.com/jiashenC, https://github.com/SherlockNoMad	2024-07-25 01:11:03 +00:00
fduwjj	e20fb5e975	[PTD][c10d] Include PG status into flight recorder (#131268 ) We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268 Approved by: https://github.com/shuqiangzhang	2024-07-25 01:01:00 +00:00
Zixi Qi	c3fe9075a9	[ROCM] Use hipblaslt version from hipblaslt runtime instead of header for tunableops validator (#131078 ) Summary: When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide. Test Plan: Verified generated tunableops kernel selection has the correct hipblaslt version from runtime: ``` Validator,PT_VERSION,2.5.0 Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty Validator,HIPBLASLT_VERSION,800-bf2c3184 Validator,ROCM_VERSION,6.0.0.0-12969-1544e39 Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676 GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019 GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723 GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047 ``` Differential Revision: D59889043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell	2024-07-25 00:54:07 +00:00
cyy	803c5b8640	[CMake] Fix private compile options for CUDA code (#130546 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130546 Approved by: https://github.com/ezyang	2024-07-25 00:22:18 +00:00
Oguz Ulgen	7a42470bcb	Annotate all InstructionTranslator (#131509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509 Approved by: https://github.com/zou3519	2024-07-24 23:45:53 +00:00
Angela Yi	7535b23a25	[export] Fix set_grad hoo if output is empty (#131511 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1467707973867409/ Test Plan: CI Differential Revision: D60135531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131511 Approved by: https://github.com/ydwu4	2024-07-24 23:17:20 +00:00
Shangdi Yu	29c9f8c782	[export] Fix `graph_break` log registration error when importing export/_trace.py (#131523 ) Summary: When importing `_trace.py`, put `torch._dynamo.exc.Unsupported` in the global variable ``_ALLOW_LIST`` can cause import to ``export/_trace.py`` to fail with error: ValueError: Artifact name: 'graph_breaks' not registered, please call register_artifact('graph_breaks') in torch._logging.registrations. The error is directly raise on line `graph_breaks_log = torch._logging.getArtifactLogger(__name__, "graph_breaks")` in `_dynamo/exc.py`. I've checked that ``register_artifact('graph_breaks')`` does already exist in torch._logging.registrations. Explicitly call `import torch._logging` doesn't fix the issue. (see T196719676) We move ``_ALLOW_LIST`` to be a local variable. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_model_for_disagg_acc (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)' buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_test_dsnn_module (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)' Differential Revision: D60136706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131523 Approved by: https://github.com/zhxchen17	2024-07-24 22:40:24 +00:00
PyTorch MergeBot	236e06f9f9	Revert "Ensure staticmethods can be allowed in graph (#130882 )" This reverts commit 93fdd0237dcfe8cb4c65f3596aef123417b760a1. Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/clee2000 due to torchrec test still broken internally D59945836 ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2249003059))	2024-07-24 22:32:41 +00:00
PyTorch MergeBot	5db5865614	Revert "Annotate all InstructionTranslator (#131509 )" This reverts commit eafbd20f23746aa6b9090d989a4ccb059f45297e. Reverted https://github.com/pytorch/pytorch/pull/131509 on behalf of https://github.com/clee2000 due to sorry need to revert this to revert something else, I think you only need to rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/131509#issuecomment-2249000843))	2024-07-24 22:29:49 +00:00
Nikita Shulga	a7e20ef7e4	[BE] Get rid of missing destructor override warning (#131204 ) Regression introduced by https://github.com/pytorch/pytorch/pull/126376 Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~HIPHooksInterface() = default; ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-07-24 22:29:31 +00:00
Oguz Ulgen	b56939dae1	Annotate more InstructionTranslator (#131680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131680 Approved by: https://github.com/zou3519 ghstack dependencies: #131676	2024-07-24 22:14:29 +00:00
Shangdi Yu	f9322c26b2	Remove _export/exported_program.py (#131597 ) Summary: We removed references to _export/exported_program.py in executorch in D60052318. Now we can remove this file. Update the pin to executorch. Test Plan: contbuild & OSS CI: Differential Revision: D60072980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131597 Approved by: https://github.com/avikchaudhuri	2024-07-24 22:04:17 +00:00
PyTorch MergeBot	eb54ca7abe	Revert "[BE] Get rid of missing destructor override warning (#131204 )" This reverts commit 8a890b72dc3e4dcd501060c2a2fee139c235a8b8. Reverted https://github.com/pytorch/pytorch/pull/131204 on behalf of https://github.com/atalman due to sorry @malfet need to revert to make CI green, lets reland with ciflow/periodic label on ([comment](https://github.com/pytorch/pytorch/pull/131204#issuecomment-2248898033))	2024-07-24 21:08:49 +00:00
PaliC	544f950d14	[BE] Improve error message when there are internal changes (#131547 ) Fixes https://github.com/pytorch/test-infra/issues/4988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131547 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/atalman	2024-07-24 20:38:08 +00:00
joydddd	7f61324268	Add sparse block to flex_decoding kernel (#130884 ) fix typo Finish flex_decoding block sparse Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884 Approved by: https://github.com/drisspg	2024-07-24 20:30:25 +00:00
angelayi	b90aa18569	[aoti] Add initial custom op support (#127034 ) Re-land of https://github.com/pytorch/pytorch/pull/125242 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034 Approved by: https://github.com/malfet	2024-07-24 20:29:55 +00:00
Aaron Orenstein	44fdf24967	[BE] typing for decorators - jit/_decompositions (#131566 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131566 Approved by: https://github.com/oulgen, https://github.com/zou3519	2024-07-24 20:28:28 +00:00
Jerry Mannil	2b83e4f8d7	[ROCm] Enable flex decoding unit tests (#131048 ) Flex decoding tests are passing with upstream pytorch on MI300X/MI2XX. Only flex attention unit tests have issues. [result_mi250.log](https://github.com/user-attachments/files/16286954/result_mi250.log) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131048 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/malfet	2024-07-24 20:25:34 +00:00
mori360	84cd062fb2	[2/3] 3D Composability - move pp tests (#129801 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801 Approved by: https://github.com/wconstab ghstack dependencies: #129800	2024-07-24 20:17:54 +00:00
Justin Chu	a9e6356271	[ONNX] Update torch.onnx.export API (#131501 ) - Add a `kwargs` option; add the `dynamic_shapes` option so users can supply it directly to `torch.export`. - Make the options keyword-only arguments (bc-breaking) - Deprecate the `training` and `operator_export_type` options and include a warning message. The exact time for removal is TBD but the message should discourage users from using the options. - Deprecate two functions `exportable_ops` and pretty print export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131501 Approved by: https://github.com/titaiwangms	2024-07-24 20:03:17 +00:00
Justin Chu	9db567f17d	[ONNX] Set dump_exported_program to True in bench (#131670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670 Approved by: https://github.com/titaiwangms	2024-07-24 20:02:03 +00:00
Catherine Lee	85fa66be04	Add rerun_disabled_tests for inductor (#131681 ) Test in prod? THis also turns on mem leak check Briefly checked that ``` python3 ".github/scripts/filter_test_configs.py" \ --workflow "inductor" \ --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \ --test-matrix "{ include: [ { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" }, { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, ]} " \ --selected-test-configs "" \ --pr-number "${PR_NUMBER}" \ --tag "${TAG}" \ --event-name "schedule" \ --schedule "29 8 * * *" \ --branch "${HEAD_BRANCH}" ``` has rerun disabled tests option in the test matrix I don't think all these things need to run but I'm not sure which ones (probably just inductor?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681 Approved by: https://github.com/zou3519	2024-07-24 19:56:00 +00:00
Andrew Gallagher	65ce2bf465	Allow setting `PYTHON_LIB_REL_PATH` via environment variable (#128419 ) This allows builds to customize the location where caffe2's Python modules are installed to. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128419 Approved by: https://github.com/PaliC, https://github.com/d4l3k, https://github.com/malfet	2024-07-24 19:49:06 +00:00
mori360	074b46b7d9	[1/3] 3D Composability - move fsdp tests (#129800 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129800 Approved by: https://github.com/awgu	2024-07-24 19:47:34 +00:00
Oguz Ulgen	e0f1bf14a4	Fully type torch/utils/_config_module.py (#131676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131676 Approved by: https://github.com/zou3519	2024-07-24 19:36:09 +00:00
Zain Rizvi	05681b6838	Migrate missed experimental jobs to Amazon2023 AMI (#131485 ) Adding in a few jobs that got missed in https://github.com/pytorch/pytorch/pull/131250 Those jobs have passed with the new AMI: https://github.com/pytorch/pytorch/actions/runs/10063808680/job/27820050195?pr=131485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131485 Approved by: https://github.com/atalman, https://github.com/malfet	2024-07-24 19:33:02 +00:00
Jithun Nair	05064f2827	[CI] Move all ROCm jobs to periodic frequency (#131637 ) `inductor` and `rocm` workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: https://github.com/pytorch/pytorch/pull/131489#issue-2425804464 * Move rocm.yml to cron frequency * Move ROCm CI jobs from inductor.yml to inductor-rocm.yml * Introduce `ciflow/inductor-rocm` as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load) * After this PR, only `trunk` workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131637 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/huydhn	2024-07-24 19:26:58 +00:00
Bin Bao	8aff6caf67	[CI][dashboard] Rename cpu-x86 to cpu_x86 (#131658 ) Summary: '-' is used as a special separator by upload_dynamo_perf_stats.py, so switch to '_' instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131658 Approved by: https://github.com/huydhn	2024-07-24 19:16:52 +00:00
Bin Bao	3ce6f61416	[AOTI] Support fallback ops not in inductor_fallback_ops (#131247 ) Summary: For aten ops that are not listed in inductor_fallback_ops, AOTI will use proxy executor to execute them instead of erroring out as missing C shim implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131247 Approved by: https://github.com/angelayi	2024-07-24 19:16:43 +00:00
Zain Rizvi	aeca9845a6	Migrate Lint jobs to Amazon 2023 AMI (#131514 ) Continuing in the same vein as https://github.com/pytorch/pytorch/pull/131250, migrate all self-hosted lint.yml jobs to use the new Amazon 2023 AMI Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131514 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/huydhn	2024-07-24 19:11:02 +00:00
Andrii Grynenko	b98b3127f7	[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 ) Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately. Test Plan: unit test Reviewed By: jamesperng, asiab4, c-p-i-o Differential Revision: D59842868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021 Approved by: https://github.com/asiab4	2024-07-24 18:38:33 +00:00
William Wen	7718024d2b	[3.13] support 3.13 multiline traces in munge_exc (#131207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #131206	2024-07-24 18:22:30 +00:00
William Wen	f0378912a0	[3.13, dynamo] fix test/dynamo/test_bytecode_utils.py (#131206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131206 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-07-24 18:22:30 +00:00
Zhengxu Chen	a86909d251	[inductor] Type annotate constant_folding.py (#131364 ) Summary: Type annotate constant_folding.py Test Plan: mypy Reviewed By: angelayi Differential Revision: D60063872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131364 Approved by: https://github.com/angelayi	2024-07-24 18:20:06 +00:00
Haoci Zhang	8fe5b93667	support zb1p and zb2p algorithms (#130752 ) Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported. We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached. [zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752 Approved by: https://github.com/H-Huang	2024-07-24 17:58:46 +00:00
Prachi Gupta	5e6cfb7db5	Add an extra shard for distributed periodic jobs (#131498 ) Fixes issue of timeouts being observed in ROCm periodic workflow for distributed runs Pull Request resolved: https://github.com/pytorch/pytorch/pull/131498 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/clee2000	2024-07-24 16:44:53 +00:00
William Wen	106c6a49f5	[dynamo] limit number of compiles per frame (#130891 ) Fixes https://github.com/pytorch/pytorch/issues/130776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891 Approved by: https://github.com/anijain2305	2024-07-24 16:43:40 +00:00
Aaron Orenstein	abcd329359	[BE] typing for decorators - onnx/symbolic_helper (#131565 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131565 Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519, https://github.com/titaiwangms	2024-07-24 16:39:47 +00:00
zdevito	0e71a88f9b	Support IPC for Expandable Segments (#130890 ) This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2	2024-07-24 15:45:40 +00:00
Zain Rizvi	eb5883f8aa	Add new runner labels to actionlint (#131525 ) Adding the labels corresponding to the Amazon2023 ami Pull Request resolved: https://github.com/pytorch/pytorch/pull/131525 Approved by: https://github.com/atalman	2024-07-24 15:28:59 +00:00
Xu Han	72d17d95d7	[inductor] Enable dynamo for Windows. RC1 (#131286 ) Changes: 1. Enable Windows in `check_if_inductor_supported`. 2. Disable Windows in `AotCodeCompiler`. 3. Force Windows inductor to `c++20` to support `std::enable_if_t`. 4. Disable `test_x86inductor_quantizer` UT on `Windows` temporary, It still some issue need to be fix: https://github.com/pytorch/pytorch/pull/131308 . Based on this PR, I have run first model `resnet18` on Windows inductor successful. <img width="1036" alt="image" src="https://github.com/user-attachments/assets/2642bda1-1845-417a-aaba-39bdf22e65d6"> TODO: 1. Upgrade pytorch Windows build to `c++20`. 2. Fix and re-enable `test_x86inductor_quantizer` UT on `Windows`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131286 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-24 15:26:55 +00:00
Jane Xu	4c7f22dee2	[BE] remove unnecessary _dispatch_sqrt by using 0.5 (#131358 ) Based on the discussion here where 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358 Approved by: https://github.com/albanD	2024-07-24 14:58:57 +00:00
rzou	98984422eb	[triton_op] fix autotuning (#131363 ) The problem was we were shoving SymInts into the constant_args side table. The root problem is that torch.fx.node.base_types, which we use to determine what can be put in the graph, doesn't actually have SymInt in it. This PR fixes base_types to include SymInt. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363 Approved by: https://github.com/oulgen, https://github.com/justinchuby	2024-07-24 14:03:37 +00:00
Andrew Gu	bc938184de	[FSDP2] Added `set_reduce_scatter_divide_factor` (#129286 ) This PR adds an API `FSDPModule.set_reduce_scatter_divide_factor` to allow setting a custom gradient divide factor for reduce-scatter. This can be useful when using parallelisms in combination with FSDP (e.g. expert parallelism), where gradients need to be divided by a custom factor (e.g. an extra `EP` factor). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129286 Approved by: https://github.com/weifengpy	2024-07-24 12:42:35 +00:00
PyTorch MergeBot	8ffd109a00	Revert "Fix py codegen to delete values that don't have any users (#131028 )" This reverts commit 466c167b71e6021f8eadcfbae1d9156a375663ce. Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/atalman due to breaks CI ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2247771530))	2024-07-24 12:21:43 +00:00
cyyever	451462dbff	[1/N] Add missing constructors or assignment operators (#131077 ) Just mark them as deleted in most cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131077 Approved by: https://github.com/ezyang	2024-07-24 12:09:39 +00:00
Edward Z. Yang	0c6f1ca064	Introduce torch._dynamo.config.enable_compiler_collectives for syncing compilation across ranks (#130935 ) This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR: 1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective) 2. Using this infrastructure to synchronize automatic dynamic choices across all ranks The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState. Here is how automatic dynamic synchronization works: 1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program. 2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo 3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency. For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935 Approved by: https://github.com/jansel	2024-07-24 11:24:11 +00:00
Yifu Wang	85d3ee1d67	[micro_pipeline_tp] refactor all-gather and reduce-scatter pattern matchers to be more flexible and testable (#131409 ) High level goals: - Cover the all-gather and reduce-scatter pattern matchers with unit tests - Make it easier to exclude certain collectives as async-tp candidates - Make it easier to match other all-gather and reduce-scatter variants (e.g. fp8 collectives) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131409 Approved by: https://github.com/weifengpy	2024-07-24 11:16:27 +00:00
Peter Bell	89d5391bbf	[inductor] Kill mark_node_as_mutating (#130834 ) Resubmit of #129346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834 Approved by: https://github.com/lezcano ghstack dependencies: #130832, #130833	2024-07-24 11:11:19 +00:00
Peter Bell	6415c45da5	[inductor] Use multiple outputs for flex-attention (#130833 ) Resubmit of #129344 This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833 Approved by: https://github.com/lezcano ghstack dependencies: #130832	2024-07-24 11:11:19 +00:00
Peter Bell	95c248751b	[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832 ) Resubmit of #129325 Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832 Approved by: https://github.com/lezcano	2024-07-24 11:11:14 +00:00
Justin Chu	a4c3f29047	[ONNX][BE] Remove ruff skips in torch/onnx (#131368 ) Remove all ruff skips for torch/onnx since we do not do runtime type checking anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131368 Approved by: https://github.com/titaiwangms, https://github.com/Skylion007	2024-07-24 10:56:43 +00:00
Nikita Shulga	62e566b345	[BE] Remove suppression of inconsistent missing overrides (#131524 ) This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204 - Remove global `-Wno-error=inconsistent-missing-override` - Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED` - Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524 Approved by: https://github.com/atalman	2024-07-24 10:07:36 +00:00
Avik Chaudhuri	83d19620f6	kill tmp _is_executorch flag (#131488 ) Test Plan: existing tests Differential Revision: D60126186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131488 Approved by: https://github.com/ydwu4	2024-07-24 08:51:37 +00:00
Bin Bao	1e34870796	[CI][dashboard][reland] Collect PT2 cpu perf nightly (#131560 ) Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131560 Approved by: https://github.com/huydhn	2024-07-24 08:50:33 +00:00
hxwang	276b5238ef	[bug] Add is_compiling check for optimizers to avoid untracked tensor during graph tracing (#130909 ) Hey folks, I was using the `stateless_func` [here](`7c45476d38/torch/distributed/_spmd/api.py (L435)`), which worked well before [this commit](https://github.com/pytorch/pytorch/pull/111084) but then introduced a `_tensor_constant0` and made this func non-stateless. Since there is no way to retrieve this constant tensor before compilation and performance is not an issue when tracing a graph, I think it might be good to fall back to the other branch. ![image](https://github.com/user-attachments/assets/6ee4487d-456b-47e0-8c1d-66cb5a641d47) ![image](https://github.com/user-attachments/assets/1ed46502-e50e-45c4-9751-49aa5a4590ae) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130909 Approved by: https://github.com/mlazos	2024-07-24 08:29:27 +00:00
cyy	41189b0da4	Simplify THPEvent_get_device (#131466 ) Because self->event.device() always returns Device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131466 Approved by: https://github.com/albanD	2024-07-24 08:24:01 +00:00
Janani Sriram	e782918b8e	[NestedTensor] Add example NestedTensor objects with inner dimension of size 1 to tests reducing along jagged dimension for NestedTensor (#131516 ) Add example `NestedTensor`s with inner dimension of size `1` to `_get_example_tensor_lists` with `include_inner_dim_size_1=True`. This diff creates `NestedTensor`s of sizes `(B, , 1)` and `(B, , 5, 1)`, ensuring that the current implementations of jagged reductions for `sum` and `mean` hold for tensors of effective shape `(B, )` and `(B, , 5)`. Differential Revision: [D59846023](https://our.internmc.facebook.com/intern/diff/D59846023/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131516 Approved by: https://github.com/davidberard98	2024-07-24 07:01:39 +00:00
Adnan Akhundov	e9db1b0597	Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431 ) Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_ autotune_with_unsupported_args ... ---------------------------------------------------------------------- Ran 3 tests in 3.636s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431 Approved by: https://github.com/oulgen, https://github.com/zou3519	2024-07-24 05:37:09 +00:00
Oguz Ulgen	eafbd20f23	Annotate all InstructionTranslator (#131509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509 Approved by: https://github.com/zou3519	2024-07-24 05:31:01 +00:00
eellison	5772c13f56	Dont wrap negative indexing in scatter reduce (#131503 ) Fix for https://github.com/pytorch/pytorch/issues/131321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503 Approved by: https://github.com/shunting314	2024-07-24 04:01:32 +00:00
Michael Lazos	9f96d4b61b	Disable inlining on cudagraph fallback tests (#131557 ) The cudagraph fallback tests should only run without nn module inlining. The [rerecord limit](`fc3d2b26cd/torch/_inductor/cudagraph_trees.py (L1922)`) is ignored if nn module inlining is disabled. Arguably it should just be higher, but this PR addresses the failures and allows inlining to be on by default on main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131557 Approved by: https://github.com/anijain2305 ghstack dependencies: #131556	2024-07-24 04:00:02 +00:00
Michael Lazos	9575b1afad	Ensure tensor dict is populated with compiled autograd (#131556 ) The issue addressed is that compiled autograd changes the calling convention of the FX graph to only have a single placeholder which contains a list of inputs. In this case, the meta of the tensor input nodes don't contain the `tensor_dict` meta. This adds them. The context is that `tensor_dict` is used to convey if a tensor is an input with a static address. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131556 Approved by: https://github.com/anijain2305	2024-07-24 04:00:02 +00:00
Oguz Ulgen	dffbd3a1e2	Add mypy typing to pattern_matcher (#131506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131506 Approved by: https://github.com/zou3519	2024-07-24 02:55:43 +00:00
Nick Westlake	7124efa81b	Include _native.h for structured_native_functions (#131208 ) In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups". Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208 Approved by: https://github.com/bdhirsh	2024-07-24 02:55:36 +00:00
Jiashen Cao	31da9ee711	Use explain function to provide more meaningful information when conversion failed. (#131214 ) Summary: In the script of testing different families of models, when the conversion failed, we switch to use output from the explain function to provide more meaningful information. Test Plan: Manual testing with attatched log information. ``` buck2 run mode/dev-nosan sigmoid/inference/ts_migration:main -- --mode test_all --test_suites ads_merge --model_id 440779101 ``` ``` Processing 440779101_5455.predictor.disagg.gpu.merge model_name: 440779101_5455.predictor.disagg.gpu.merge has_ts_model: True has_sample_inputs: True ops_maybe_missing_meta: set() ts_can_run: True ts_run_exception: None can_convert: False convert_exception: Unsupported nodes are found in the following list: 0. prim::Loop [%14259 : int = prim::Loop(%14258, %1129, %1126), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19] 1. prim::Loop [%14326 : int = prim::Loop(%1115, %1129, %14259), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19] ep_result_correct: None ep_run_exception: None can_package: None package_exception: None sigmoid_can_run: None sigmoid_run_exception: None sigmoid_result_correct: None ``` Reviewed By: SherlockNoMad Differential Revision: D59971446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131214 Approved by: https://github.com/angelayi	2024-07-24 02:42:18 +00:00
Animesh Jain	0ceaabaf71	[easy][inline-inbuilt-nn-modules] Update test (#131563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131563 Approved by: https://github.com/mlazos ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480, #131512	2024-07-24 02:32:19 +00:00
Aaron Orenstein	0e780a7d69	[BE] Remove some mypy allow-untyped-decorators that are no longer needed (#131564 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131564 Approved by: https://github.com/oulgen	2024-07-24 02:00:08 +00:00
Jun Luo	abb313b466	[torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873 ) Summary: As title Test Plan: CI tests Reviewed By: joebos Differential Revision: D59036602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873 Approved by: https://github.com/hanzlfs	2024-07-24 01:52:21 +00:00
Junjie Wang (PyTorch)	aa1c78c7e9	[PTD][c10d][EZ] LOG error for nccl error rather than info (#131483 ) Summary: As title, when we get nccl exception we should log it as error not info. Test Plan: CI Reviewed By: csmodlin, rmiao Differential Revision: D60123773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131483 Approved by: https://github.com/fegin	2024-07-24 01:08:00 +00:00
YangQun1	466c167b71	Fix py codegen to delete values that don't have any users (#131028 ) Fixes #131025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028 Approved by: https://github.com/ezyang	2024-07-24 01:03:56 +00:00
Nikita Shulga	14495ce288	[BE][MPS] Use `isOperatingSystemAtLeastVersion:` (#131513 ) Instead of trying to come up with different checks for classes resonding to selectors Pull Request resolved: https://github.com/pytorch/pytorch/pull/131513 Approved by: https://github.com/atalman	2024-07-24 00:54:25 +00:00
Jiong Gong	76f7b3e560	[inductor][cpp][gemm] improve thread blocking heuristics (#131024 ) This PR improves the thread blocking heuristics to favor full occupancy as much as possible. Also, the "m x n" block size is made as squared as possible for better data reuse. Take the shape M=20000, N=64, K=128 as an example, the original heuristics couldn't use up all the threads when the number of threads is large, say 60: AUTOTUNE linear_unary(200000x128, 64x128, 64) _linear_pointwise 0.1010 ms 100.0% cpp_packed_gemm_0 0.8303 ms 12.2% 0722 02:26:39.220660 302553 torch/_inductor/codegen/cpp_gemm_template.py:503] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32) V0722 02:26:39.221042 302553 torch/_inductor/codegen/cpp_gemm_template.py:507] [0/0] Cache blocking: GemmBlocking(block_m=625, block_n=1, block_k=4) V0722 02:26:39.221118 302553 torch/_inductor/codegen/cpp_gemm_template.py:509] [0/0] Thread blocking: GemmBlocking(block_m=625, block_n=1, block_k=4) V0722 02:26:39.221252 302553 torch/_inductor/codegen/cpp_gemm_template.py:526] [0/0] Number of threads: 60, occupancy: (10, 2, 1) After this PR: AUTOTUNE linear_unary(200000x128, 64x128, 64) _linear_pointwise 0.1143 ms 100.0% cpp_packed_gemm_0 0.1228 ms 93.1% V0722 02:29:49.261794 304201 torch/_inductor/codegen/cpp_gemm_template.py:309] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32) V0722 02:29:49.262860 304201 torch/_inductor/codegen/cpp_gemm_template.py:313] [0/0] Cache blocking: GemmBlocking(block_m=64, block_n=1, block_k=8) V0722 02:29:49.262951 304201 torch/_inductor/codegen/cpp_gemm_template.py:315] [0/0] Thread blocking: GemmBlocking(block_m=69, block_n=79, block_k=8) V0722 02:29:49.263075 304201 torch/_inductor/codegen/cpp_gemm_template.py:332] [0/0] Number of threads: 60, occupancy: (15, 4, 1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131024 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w	2024-07-24 00:36:29 +00:00
Shangdi Yu	fdc9a1404e	Remove _BLACK_LISTED_OPS (#131361 ) Summary: remove _BLACK_LISTED_OPS after https://github.com/pytorch/pytorch/pull/100749 Test Plan: contbuild & OSS CI Differential Revision: D60056130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131361 Approved by: https://github.com/angelayi	2024-07-24 00:15:27 +00:00
Nicolas Macchioni	2cf220956a	[inductor] fix CacheBase.get_system on AMD (#131365 ) Summary: CacheBase.get_system on AMD is missing device name and hip version, fix that Test Plan: on AMD: ``` buck run fbcode//mode/opt-amd-gpu scripts/nmacchioni/repros/amd_cache_key:repro {'device': {'name': 'gfx942:sramecc+:xnack-'}, 'version': {'triton': '3.0.006965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-4856d26164925fd955c779d8f67ecf47cc5754052b008714b3a580d708b13dd8-06965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-23d635e690d670bf61798e1259674b78c0ed5ba222ab6a455f329f27a758fc2d-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855-166fbf4e6f8845f354611638861a2a9e1dc2654224c278e10b566f09549dae7e-ccd93feaad4c82c8c1604557340de15fda0a3c84fe83f5a4d1e12a07a77bf3f4-cf28658fa328f7f283ec4e6ccc6c48d7c2a8ddbdf5134d3eb35c9b38ce4ace44-b9d80690b3109c2aaf5ece450d62e93b37eb6ab38552089794b3bb36e36a22b3-36130a37af1b19a0dec569aa08d30b00c74c8f02b6b632999d86dea169146792-4a620da64e0c263067f0dbf6c721f5214a5ac315625a07dd98520502ddf7e22f-6ace95666f6a4ecd2b1a7fc7ae865d1a9239608bd020cb6e4b8d15233c2dd9b3', 'hip': '6.0.32830'}, 'hash': 'c4db04316e15953dda8648f5a43a3f208f2c0ba454666cc7d78e40527aab85ec'} ``` on Nvidia: ``` buck run fbcode//mode/opt scripts/nmacchioni/repros/amd_cache_key:repro {'device': {'name': 'NVIDIA PG509-210'}, 'version': {'triton': '6de41ec76ecad84e618d692e6793a4ebe707ae68a0c033a222105daa72214d7c', 'cuda': '12.0.0'}, 'hash': 'b58d0aa37d80fc2932c1b7576ca876b77aa1258db1c14e27d1f201bd15376faf'} ``` Differential Revision: D60062972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131365 Approved by: https://github.com/eellison	2024-07-24 00:11:59 +00:00
rzou	480ae51f85	[pytree] Only import optree if it's used (#131478 ) torch.utils._pytree imports optree if it's available. Instead, we change it to if it gets used. The motivation for this is better isolation. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131478 Approved by: https://github.com/albanD	2024-07-24 00:10:49 +00:00
Animesh Jain	6850e42266	[dynamo][exception] Remove older specialization for StopIteration (#131512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131512 Approved by: https://github.com/yanboliang ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480	2024-07-24 00:06:53 +00:00
Animesh Jain	e2b941a1b4	[dynamo] Rename TENSOR_ALIASING to OBJECT_ALIASING. Permit OBJECT_ALIASING for dict guards (#131480 ) Fixes https://github.com/pytorch/pytorch/issues/129667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131480 Approved by: https://github.com/williamwen42 ghstack dependencies: #131347, #131367, #131378, #131389, #131405	2024-07-24 00:06:53 +00:00
Anshul Sinha	e39f136c35	[debug][dtensor] implemented activation checkpointing differentiation (#130996 ) Summary While trying to integrate CommDebugMode with TorchTitan, I realized that the forward_hooks were being registered even though it was in the backward pass. After investigating, I realized that it was activation checkpointing that was causing this. In order to prevent users from being confused, I edited CommDebugMode so that it could differentiate between backward pass operations and activation checkpointing operations. I have also added an example case showing that CommDebugMode is able to successfully differentiate between the backward pass and activation checkpointing. The output for the example can be seen below. Test Case torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e activation_checkpointing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130996 Approved by: https://github.com/XilunWu ghstack dependencies: #131419	2024-07-23 23:44:56 +00:00
Anshul Sinha	7b375c3682	[dtensor][debug] changed which module tracker I inherited from to fix bug with activation checkpointing (#131419 ) Summary I switched the module tracker I had been inheriting from PyTorch’s all purpose one to the one written by Sanket in the distributed tools folder. I did this because the original one messed up activation checkpointing by adding itself to the parent set in the backward_pre_hook and then in the forward_pre_hook for the activation_checkpointing. Test Case pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/131419 Approved by: https://github.com/XilunWu	2024-07-23 23:44:56 +00:00
Yifu Wang	161c18ed0b	SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter (#130583 ) ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583 Approved by: https://github.com/weifengpy	2024-07-23 23:37:48 +00:00
Aaron Orenstein	1930698140	Fix fake tensor SymInt caching when there's a SymInt storage_offset (#131500 ) Test Plan: Internal unit tests failed before and succeeded after. Differential Revision: D60131273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131500 Approved by: https://github.com/clee2000	2024-07-23 23:37:04 +00:00
Will Feng	fc3d2b26cd	Use fake PG for test_compute_comm_reordering.py unit tests (#131415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415 Approved by: https://github.com/yifuwang	2024-07-23 22:53:23 +00:00
Shunting Zhang	980bb54361	[BE][Inductor] fix failures in test_padding.py (#131417 ) The failure only happens [internally](https://www.internalfb.com/tasks/?t=195598864) because the main block was not executed when the tests are run internally. Differential Revision: [D60083954](https://our.internmc.facebook.com/intern/diff/D60083954) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131417 Approved by: https://github.com/eellison	2024-07-23 21:53:59 +00:00
Shunting Zhang	53f1f75061	[BE][Inductor] fix do_bench test (#131402 ) The test fail internally [T195592444](https://www.internalfb.com/intern/tasks/?t=195592444) (This is meta internal link). But we don't see the failure in OSS. It turns out that there are 2 issues: 1. `run_test('cuda')` is improperly handled since it tries to import a module named 'cuda' if cuda is available. Since the import fails, all tests in the file are skipped. This hides the failure in OSS. The failure is exposed in internal tests since the main block which runs `run_test('cuda')` is skipped sometimes. 2. fix the real issue that incompatible inputs are provided to `do_bench`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131402 Approved by: https://github.com/eellison	2024-07-23 21:52:35 +00:00
Aaron Orenstein	5a0068cc69	[BE] mypy: disallow untyped decorators (#131428 ) Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations. Step 1 - Enable the error and override in all the offending files. #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428 Approved by: https://github.com/justinchuby, https://github.com/oulgen	2024-07-23 21:50:55 +00:00
Aaron Orenstein	e3ca4e79e1	Fix mypy errors introduced by #131400 (#131522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131522 Approved by: https://github.com/zou3519, https://github.com/eellison	2024-07-23 21:25:21 +00:00
Zhengxu Chen	c9e74449f3	bump executorch commit pin. (#131486 ) Summary: as title. Target commit: `6153b1bf7b` Test Plan: CI Differential Revision: D60125590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131486 Approved by: https://github.com/huydhn	2024-07-23 21:25:07 +00:00
Nikita Shulga	8a890b72dc	[BE] Get rid of missing destructor override warning (#131204 ) Regression introduced by https://github.com/pytorch/pytorch/pull/126376 Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4: In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7: In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~HIPHooksInterface() = default; ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-07-23 21:02:14 +00:00
Xu Zhao	4eee2e7a6d	[operator_benchmark] Remove TARGETS from broken benchmarks (#131460 ) Summary: Remove operator_benchmark caffe2 build due to the removal of caffe2: `2fd75667b4` Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain. Test Plan: Sandcastle CI Differential Revision: D60086216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460 Approved by: https://github.com/vmpuri	2024-07-23 20:06:08 +00:00
PyTorch MergeBot	8497930766	Revert "[CI][dashboard] Collect PT2 cpu perf nightly (#131369 )" This reverts commit 9851c7313d118517d21a112960044e0fdbf560b1. Reverted https://github.com/pytorch/pytorch/pull/131369 on behalf of https://github.com/atalman due to Sorry need to revert looks like , please run ciflow/inductor looks like this caused failure in [pytorch/pytorch/actions/runs/10058412015/job/27802257096](https://github.com/pytorch/pytorch/actions/runs/10058412015/job/27802257096) ([comment](https://github.com/pytorch/pytorch/pull/131369#issuecomment-2246142022))	2024-07-23 19:41:49 +00:00
PyTorch MergeBot	d4e3fd613c	Revert "[CI] Relax config name matching for cpu inductor tests (#131467 )" This reverts commit aa54bcb6d25fc7c9ac23b82b74ea45f03033c8b2. Reverted https://github.com/pytorch/pytorch/pull/131467 on behalf of https://github.com/atalman due to Sorry need to revert looks like https://github.com/pytorch/pytorch/pull/131369 broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/131467#issuecomment-2246136839))	2024-07-23 19:38:35 +00:00
Sergii Dymchenko	7b82ed2d59	Delete very old misleading info from .ci README (#131502 ) I think there is no way to salvage that by updating, so deleting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131502 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-07-23 19:27:36 +00:00
Michael Lazos	93fdd0237d	Ensure staticmethods can be allowed in graph (#130882 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882 Approved by: https://github.com/anijain2305, https://github.com/williamwen42	2024-07-23 18:59:19 +00:00
jananisriram	faddb0f30c	[NestedTensor] Integrate the mean operator along the jagged dimension into NestedTensor (#131132 ) Summary: Modify the existing `mean` operator in PyTorch, invoked by `torch.mean`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff enables PyTorch users to invoke `torch.mean` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Parametrize unit tests from `sum` to verify the accuracy of the ragged reduction implementation for `torch.mean`. Add unit tests and parametrize `sum` unit tests to verify error handling for unsupported features in `NestedTensor` `torch.mean`. Test Plan: Verify that the new unit test passes via the following command: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_mean ``` ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op ``` Differential Revision: D59654668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131132 Approved by: https://github.com/davidberard98, https://github.com/jbschlosser	2024-07-23 18:48:34 +00:00
drisspg	120ca23a1f	Fix IMAs in Flash-Attention splitkv kernel (#131277 ) # Summary While debugging CI failures for flash_attention tests I stumbled across 2 IMAs for the split-kv variant of flash attention. 1. Illegal global memory writes during the writing of softmax_lse_accum. This was pinpointed to the temporary liftime of these out_accum and softmax_lse_accum. These were likely getting their refcount dropped before the kernel launch that used, them allowing them to potentially get used for other allocations. 2. After debugging this there was illegal writes of the combine kernel. I was able to pinpoint this to the writing to the reduce LSE. From my understanding it was making assumption that kBlocKM evenly divided the global number of rows and wasn't masking out these writes. ### History My line of thinking for this: We create the temporary split accum + LSE stats tensors to store the data for each split. We then launch a follow up kernel to do the reduction. Under ordinary non roofline memory usage the cuda memory caching allocator will keep these allocations alive even though the tensors were created within a temporary scope and no longer have any live references. On CI we often run near max memory usage. We change/add tests and suddenly we get close to oom threshold. The memory allocator will reap these segments and we get write after free errors. After that fix I did get further past the splitkv_flash kernel and then got the following error: ``` Shell ❯ TORCH_DISABLE_ADDR2LINE=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --show-backtrace=device --tool memcheck --log-file ima.txt python ima.py softmax_lseaccum_ptr =0x7f5ebb208a00 oaccum_ptr =0x7f5ebb208c00 softmax_lse_ptr = 0x7f5ebb208800 ❯ ❯ head ima.txt -n 10 ========= COMPUTE-SANITIZER ========= Invalid __global__ write of size 4 bytes ========= at void pytorch_flash::flash_fwd_splitkv_combine_kernel<pytorch_flash::Flash_fwd_kernel_traits<(int)32, (int)64, (int)256, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, pytorch_flash::Flash_kernel_traits<(int)32, (int)64, (int)256, (int)4, cutlass::bfloat16_t>>, (int)16, (int)1, (bool)1>(pytorch_flash::Flash_fwd_params)+0x630 ========= by thread (2,0,0) in block (0,0,0) ========= Address 0x7f5ebb208804 is out of bounds ========= and is 1 bytes after the nearest allocation at 0x7f5ebb208800 of size 4 bytes ``` Okay I looked at the address and it looks like we are writing consective bytes past the softmax_lse_ptr in from the combine func: I tried padding out the softmax_lse to q_padded and no more illegal memory errors on my repro: ``` ========= COMPUTE-SANITIZER ========= ERROR SUMMARY: 0 errors ``` Fixes https://github.com/pytorch/pytorch/issues/131240 Fixes https://github.com/pytorch/pytorch/issues/131227 Fixes https://github.com/pytorch/pytorch/issues/131221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131277 Approved by: https://github.com/malfet	2024-07-23 18:26:49 +00:00
Adria Orenstein	f75d724482	Updating Types in torch/_dynamo/utils.py (#131001 ) Adds some type annotations to the torch/_dynamo/utils.py file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131001 Approved by: https://github.com/aorenste	2024-07-23 18:25:52 +00:00
Bin Bao	aa54bcb6d2	[CI] Relax config name matching for cpu inductor tests (#131467 ) Summary: Matching cpu instead of cpu_inductor should be sufficient. This fixes torchbench test failures in https://github.com/pytorch/pytorch/pull/131369. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131467 Approved by: https://github.com/zou3519	2024-07-23 18:24:29 +00:00
Avik Chaudhuri	94f22eb6b2	refactor post-trace fakification in strict (#131421 ) Summary: Previously it was unclear what `_convert_input_to_fake` actually does (used in strict), and in particular how it is different from `make_fake_inputs` (used in non-strict). This PR splits that function to work purely on user inputs, then renames it to `extract_fake_inputs` and adds a comment clarifying what it does—namely, it extracts fake inputs from a given graph module instead of "converting inputs to fake inputs" (as suggested by the current name) or "making fake inputs" (as happens in non-strict, where no tracing has taken place yet). The remainder of that function used to also fakify params and buffers. It turns out that this part is identical to what happens in non-strict, hence we also pull `make_fake_inputs` out from `non_strict_utils` into `_trace`, merge it with another util, and make both modes call it. Test Plan: existing tests Differential Revision: D60084442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131421 Approved by: https://github.com/zhxchen17	2024-07-23 18:23:03 +00:00
Shangdi Yu	f85c35872b	Remove GraphModuleOpUpgrader in _export.serde.upgrade.py (#131373 ) Summary: Remove GraphModuleOpUpgrader in _export.serde.upgrade.py and the file Test Plan: contbuild & OSS CI Differential Revision: D60067937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131373 Approved by: https://github.com/angelayi	2024-07-23 18:09:44 +00:00
Matthias Braun	22906be8f0	Do not abort on SPARSE_STATUS_INVALID_VALUE (#130382 ) Summary: Newer versions of the MKL library return `SPARSE_STATUS_INVALID_VALUE` when badly formed non-triangular matrices are passed to the `mkl_sparse_?_trsv`/`mkl_sparse_?_mrsv` functions. This would start aborting (badly written) tests that worked with the old version which just filled the result tensor with `-NaN` instead of returning an error status. This changes the code to fill the result tensor with `-NaN` on `SPARSE_STATUS_INVALID_VALUE` so we get the same behavior regardless of the MKL version in use. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:sparse -- --run-disabled` Differential Revision: D59542023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130382 Approved by: https://github.com/malfet	2024-07-23 18:09:36 +00:00
Shangdi Yu	cfb9ccab6c	[export] Filter errors by exception type, add case name (#131327 ) Summary: - Log export errors to Scuba and mark them with "classified" and "unclassified" - Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute - Add `case_name` for some exceptions. Test Plan: Running the code below logs a classified error to `torch_export_usage` table in Scuba. ``` import torch from torch._export.db.case import SupportLevel class TorchSymMin(torch.nn.Module): """ torch.sym_min operator is not supported in export. """ def forward(self, x): return x.sum() + torch.sym_min(x.size(0), 100) example_args = (torch.randn(3, 2),) tags = {"torch.operator"} support_level = SupportLevel.NOT_SUPPORTED_YET model = TorchSymMin() torch.export.export(model, example_args) `` Differential Revision: D59981459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327 Approved by: https://github.com/zhxchen17	2024-07-23 18:01:13 +00:00
PyTorch MergeBot	6b8ec2b371	Revert "[triton_op] fix autotuning (#131363 )" This reverts commit 154f27455a62314dfb689f1fe13c0cfd52490339. Reverted https://github.com/pytorch/pytorch/pull/131363 on behalf of https://github.com/ZainRizvi due to This was a tricky one, but looking at the code it's the change to torch/fx/node.py that triggered the type violation errors. Reverting since this is now breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/131363#issuecomment-2245899858))	2024-07-23 18:01:09 +00:00
Wang, Eikan	3fe72e0c2e	[4/N] Non-Tensor: Support layout, device and dtype for aten operations (#125897 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125897 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-23 17:50:17 +00:00
Shangdi Yu	68c725a094	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 17:48:38 +00:00
Feng Shi	404d640c39	[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. Consolidation with Foreach kernel: 1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode 2) The backend kernel is consolidated into ComboKernel. (Note: this is part 1 which only deals with the 1st case above.) Details: 1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels). 2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it. 2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py 3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. 4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Example: - element wise kernels original Pytorch function: ``` def test_activations(a, b, c): a1 = torch.nn.functional.relu(a) b1 = torch.nn.functional.sigmoid(b) c1 = torch.nn.functional.tanh(c) return a1, b1, c1 ``` combokernel ``` triton_heuristics.pointwise( size_hints=[512], tile_hint=TileHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'fp32', 5: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = triton_helpers.maximum(0, tmp0) tl.store(out_ptr0 + (x0), tmp1, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 400 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x1 = xindex tmp2 = tl.load(in_ptr1 + (x1), xmask) tmp3 = tl.sigmoid(tmp2) tl.store(out_ptr1 + (x1), tmp3, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex tmp4 = tl.load(in_ptr2 + (x2), xmask) tmp5 = libdevice.tanh(tmp4) tl.store(out_ptr2 + (x2), tmp5, xmask) else: pass ``` - reduction kernels Original Pytorch function: ``` def test_reduce(a, b, c): a1 = torch.sum(a, dim=0) b1 = torch.max(b, dim=0) c1 = torch.min(c, dim=0) return a1, b1, c1 ``` Generated combokernal: ``` triton_heuristics.persistent_reduction( size_hints=[32, 32], reduction_hint=ReductionHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'i64', 5: 'fp32', 6: 'i64', 7: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 20 rnumel = 20 RBLOCK_0: tl.constexpr = 32 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_0)[None, :] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (20r1)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0]) tmp3 = tl.where(rmask & xmask, tmp1, float("-inf")) tmp4 = triton_helpers.max2(tmp3, 1)[:, None] tmp6 = tl.broadcast_to(rindex, tmp3.shape) _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1) tmp5 = tmp5_tmp[:, None] tl.store(out_ptr0 + (x0), tmp4, xmask) tl.store(out_ptr1 + (x0), tmp5, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_1: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_1)[None, :] roffset = 0 rmask = rindex < rnumel r3 = rindex x2 = xindex tmp7 = tl.load(in_ptr1 + (x2 + (10r3)), rmask & xmask, other=0.0) tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1]) tmp10 = tl.where(rmask & xmask, tmp8, float("inf")) tmp11 = triton_helpers.min2(tmp10, 1)[:, None] tmp13 = tl.broadcast_to(rindex, tmp10.shape) _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1) tmp12 = tmp12_tmp[:, None] tl.store(out_ptr2 + (x2), tmp11, xmask) tl.store(out_ptr3 + (x2), tmp12, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_2: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_2)[None, :] roffset = 0 rmask = rindex < rnumel r5 = rindex x4 = xindex tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0) tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2]) tmp17 = tl.where(rmask & xmask, tmp15, 0) tmp18 = tl.sum(tmp17, 1)[:, None] tl.store(out_ptr4 + (x4), tmp18, xmask) else: pass ``` Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes. Test Plan: ``` buck2 test mode/dev-nosan caffe2/test/inductor:foreach ``` ``` buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels ``` Differential Revision: D54134695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969 Approved by: https://github.com/mlazos	2024-07-23 17:34:28 +00:00
Yueming Hao	979429ca89	[inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883 ) Fixes #126338 ## Issue Summary When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks. See more great discussions in the #126338 This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations. ## Fix The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation. When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation. ## Example ```python @torch.compile def fn(x, y): x = x.view(torch.float16) y = y.view(torch.float16) + 1 return x @ y x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) fn(x, y) ``` The output code generated before this fix is like the following. ```python triton_poi_fused_add_view_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tl.store(out_ptr0 + (x0), tmp1, xmask) triton_poi_fused_add_view_1... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0) del arg0_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view] triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0) del arg1_1 buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view] extern_kernels.mm(buf0, buf1, out=buf2) ``` As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion. The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth. The following are output code generated after this PR. ```python triton_poi_fused_add_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0) del arg1_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm] extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1) ``` The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too. The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision. `7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)` `7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)` Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire . Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883 Approved by: https://github.com/eellison	2024-07-23 17:31:39 +00:00
Oguz Ulgen	f93a6a4d31	Add mypy typing to torch_version.py (#131447 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131447 Approved by: https://github.com/angelayi ghstack dependencies: #131434	2024-07-23 17:31:07 +00:00
Animesh Jain	eab1595ce2	[dynamo] Delete wrong assertion in bind_args (#131405 ) Fix - https://github.com/pytorch/pytorch/issues/130537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131405 Approved by: https://github.com/williamwen42, https://github.com/yanboliang ghstack dependencies: #131347, #131367, #131378, #131389	2024-07-23 17:28:05 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
Zain Rizvi	f7754c6dc5	Run pull jobs with new AMI (#131250 ) Migrate all pull jobs to the new Amazon 2023 AMI runner type. Exceptions: - Distributed tests are still on the old AMI since they had some weird [test failures](https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175). Will debug those separately. - Ported over a couple trunk and slow jobs that had `sync-tag`s set with the pull jobs and so needed to be on the same AMI Revert plan, in case something starts breaking when we run these new AMIs at a larger scale: - If specific jobs start failing consistently, we bring those jobs back to the old AMI - If the failure is more widespread, revert this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131250 Approved by: https://github.com/malfet, https://github.com/atalman	2024-07-23 17:17:12 +00:00
PyTorch MergeBot	5f0b65bee7	Revert "Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842 )" This reverts commit d33804f8b6e2ea38f8446826a16be13ce4f9b71e. Reverted https://github.com/pytorch/pytorch/pull/130842 on behalf of https://github.com/clee2000 due to breaking some builds internally D60085710, Im not sure what the logs mean but I think its something about build size ([comment](https://github.com/pytorch/pytorch/pull/130842#issuecomment-2245799309))	2024-07-23 17:15:06 +00:00
Oguz Ulgen	4ca8705035	Add mypy typing to fx node (#131434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434 Approved by: https://github.com/zou3519	2024-07-23 17:00:31 +00:00
Sam Larsen	ded5bdb0de	Use inductor TestCase for test_replicate_with_compiler.py (#131053 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` torch.compiles. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Test Plan: `python test/distributed/_composable/test_replicate_with_compiler.py` Differential Revision: [D59925519](https://our.internmc.facebook.com/intern/diff/D59925519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131053 Approved by: https://github.com/eellison	2024-07-23 16:59:55 +00:00
Nikita Shulga	a5ad02d05d	Remove MacOS M2 14 runner from MacMPS job (#131465 ) As it's been dead for 2+ weeks and causing queuing issues <img width="760" alt="image" src="https://github.com/user-attachments/assets/4e806cae-3a67-4acb-b84f-1a9131d2a859"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131465 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-07-23 16:51:42 +00:00
Sherlock Huang	c1ef214046	Print ExportedProgram without color by default (#131399 ) Summary: Without plugin, colored ExportedProgram is not really readable. ![image](https://github.com/user-attachments/assets/319920a9-bb4b-4ad2-bcac-0c4f76973b11) Test Plan: CI Differential Revision: D60074481 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131399 Approved by: https://github.com/angelayi	2024-07-23 16:41:55 +00:00
Michael Lazos	db376fb643	Ensure non-contiguous indices are handled (#131430 ) The unaligned inputs checker built in the assumption that static indices are a contiguous range (ie 0, 1, 2) when with the new changes with nn module inlining break this assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131430 Approved by: https://github.com/anijain2305	2024-07-23 16:37:55 +00:00
Oguz Ulgen	4f0497c747	Divorce triton and pt2 remote caching (#131345 ) Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other. Differential Revision: D60047752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345 Approved by: https://github.com/aorenste	2024-07-23 16:28:12 +00:00
rzou	154f27455a	[triton_op] fix autotuning (#131363 ) The problem was we were shoving SymInts into the constant_args side table. The root problem is that torch.fx.node.base_types, which we use to determine what can be put in the graph, doesn't actually have SymInt in it. This PR fixes base_types to include SymInt. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363 Approved by: https://github.com/oulgen	2024-07-23 16:15:00 +00:00
Zhengxu Chen	3aa45cae77	[export] Removed deprecated dialect field from EP schema. [2/2] (#131344 ) Summary: Not landable until we've updated the pin of executorch. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D59759620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131344 Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4	2024-07-23 16:05:10 +00:00
Teja	b61600f6cc	[pytorch] fix the leak for pinned memory when using _create_cpu_state… (#131270 ) When pin_memory and share_memory both are set to True in _create_cpu_state_dict, the memory is pinned using cudaHostRegister but is never unpinned. So, once tensor is created and freed, when a new tensor is created the caching allocator is allocating the same memory. This fails with below error. ``` obj = <[RuntimeError('CUDA error: part or all of the requested memory range is already mapped\nCUDA kernel errors might be a...pile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f0028a4d6c0> pg = None, device = None, _ = None ``` This PR fixes this by unregistering this memory on tensor free by attaching a hook. This is easily reproducible with xlformers checkpointing unit tests and the fix is verified with the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131270 Approved by: https://github.com/LucasLLC	2024-07-23 15:47:21 +00:00
PyTorch MergeBot	1e86387871	Revert "Support IPC for Expandable Segments (#130890 )" This reverts commit 32c2f84e349ad6e34b8559d3f1f9c27020ae702f. Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to variable shadowing broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2245456085))	2024-07-23 14:46:28 +00:00
chuanqiw	f064dac588	[CI] change xpu ci build runner type to reduce build time (#130922 ) The current XPU build sometime needs 2+hours, change the build runner to `linux.12xlarge` to reduce build time. Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130922 Approved by: https://github.com/atalman	2024-07-23 14:45:30 +00:00
Animesh Jain	6bbef2a06b	[dynamo] Support set on KeysView (#131389 ) Fixes https://github.com/pytorch/pytorch/issues/129664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131389 Approved by: https://github.com/mlazos ghstack dependencies: #131347, #131367, #131378	2024-07-23 14:15:26 +00:00
Animesh Jain	e7c5e06772	[dynamo] Support __contains__ on __dict__ on UserDefinedClassVariable (#131378 ) Fixes https://github.com/pytorch/pytorch/issues/129665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131378 Approved by: https://github.com/mlazos ghstack dependencies: #131347, #131367	2024-07-23 14:15:26 +00:00
Animesh Jain	0bc5e26067	[dynamo] Support dict conversion of objects derived from MutableMapping (#131367 ) Fixes - https://github.com/pytorch/pytorch/issues/129662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131367 Approved by: https://github.com/williamwen42 ghstack dependencies: #131347	2024-07-23 14:15:20 +00:00
Animesh Jain	a944cce5b8	[dynamo] Support if callable on list (#131347 ) Fixes https://github.com/pytorch/pytorch/issues/130720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131347 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-23 14:15:15 +00:00
Nikita Shulga	250cdb2ac7	Fix cuda_half_test.cu (#131416 ) $atanh(1.0)$ is $\inf$ (see https://www.mathworks.com/help/matlab/ref/atanh.html ) and difference between two infinities is nan, which is neither greater, nor less nor equal to any reasonable threshold Fix the test by comparing that atanh of .5 is equal for float and half and that atanh of 1.0 equal to infinity Fixes https://github.com/pytorch/pytorch/issues/131401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131416 Approved by: https://github.com/atalman, https://github.com/albanD	2024-07-23 14:10:20 +00:00
rzou	4ac77fc6bd	[HOP] Don't send HOPs to torch_dispatch (#131370 ) I regretted the decision in https://github.com/pytorch/pytorch/pull/130606. Most user torch_dispatchs don't have enough to actually handle the HOP correctly, so for now I'd prefer that users explicitly define the interaction between the HOP and their torch_dispatch class. An example is FlopCounterMode: if we allow HOPs to get passed to it, it will ignore auto_functionalized(mm) by default but it will record flops for mm, which is weird. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131370 Approved by: https://github.com/ydwu4	2024-07-23 13:41:08 +00:00
Xinran / Allan Rui	027f35d9e6	[Inductor] Allow customize decompositions for fwd_only trace function (#131329 ) Summary: Inductor will aggressively try to decompose and lower ops into a smaller opset. However, sometimes it may not align with kernel coverage (or perf preference) on different backends. (eg. Inductor will decompose Gelu into primitive ops, but certain backends already has a Gelu op) Therefore, we need a mechanism to allow customization of decomp for trace function so that Inductor will simply pass this op through. Test Plan: Reviewers: @eellison Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131329 Approved by: https://github.com/eellison	2024-07-23 13:10:48 +00:00
albanD	eb146b10db	Only depend on sympy 1.12 for conda (no 3.13 there anyways) (#131355 ) Fixing nightly after https://github.com/pytorch/pytorch/pull/130895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131355 Approved by: https://github.com/atalman	2024-07-23 12:19:58 +00:00
Bin Bao	9851c7313d	[CI][dashboard] Collect PT2 cpu perf nightly (#131369 ) Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131369 Approved by: https://github.com/huydhn	2024-07-23 11:55:39 +00:00
Charlie West-Taylor	3f3b226ffc	Fixes for the extension backend tests (#130933 ) There were some miscellaneous issues I found: * The WrapperCodeGen subclass constructors don't accept any arguments, which doesn't mesh with how Inductor can try to construct them. * A DeviceInterface subclass for Triton doesn't implement `triton_supported() == True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130933 Approved by: https://github.com/eellison, https://github.com/jansel	2024-07-23 10:46:32 +00:00
Yang Chen	d8e2e1fe50	[aoti] use reshape instead of view for flattening tensors for the nan checker (#131302 ) For some non-contiguous tensors, tensor.view would trigger the following runtime error: "RuntimeError: view size is not compatible with input tensor’s size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(…) instead" So, let's use reshape instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131302 Approved by: https://github.com/muchulee8, https://github.com/desertfire	2024-07-23 10:15:28 +00:00
Tom Ritchford	16247987a1	Add decomposition for t_copy (#130939 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939 Approved by: https://github.com/peterbell10	2024-07-23 08:29:19 +00:00
eellison	16a2a1aad3	Annotate graph.py (#131400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400 Approved by: https://github.com/shunting314	2024-07-23 07:04:12 +00:00
Joona Havukainen	102d8e5a63	MPS LSTM backward kernel workaround on MacOS 14.4+ (#130038 ) The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2. Solves silentness of issue #125803. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038 Approved by: https://github.com/malfet	2024-07-23 06:30:40 +00:00
Shangdi Yu	29e2e2afb6	Revert D59561509: Multisect successfully blamed "D59561509: [FX][export] DCE pass, check schema for node impurity (#130395 )" for one test failure (#131341 ) Summary: This diff reverts D59561509 D59561509: [FX][export] DCE pass, check schema for node impurity (#130395) by yushangdi causes the following test failure: Tests affected: - [cogwheel:cogwheel_mtia_cmf_m5_shrunk_test#test_flow_with_verification](https://www.internalfb.com/intern/test/844425041436985/) Here's the Multisect link: https://www.internalfb.com/multisect/6533402 Here are the tasks that are relevant to this breakage: T191383430: 10+ tests unhealthy for ads_mtia_inference The backout may land if someone accepts it. If this diff has been generated in error, you can Commandeer and Abandon it. Test Plan: NA Differential Revision: D60029318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131341 Approved by: https://github.com/angelayi	2024-07-23 05:23:47 +00:00
Brian Hirsh	b2ad16f01d	avoid OpOverloadPacket.__getattr__ calls in inductor lowering (#131348 ) we have seen stacktrace samples showing that a lot of compilation time is spent in exceptions raised in `OpOverloadPacket.__getattr__`. It's not entirely clear why/how this happens, but I spot-checked a few places in `_inductor.graph.py` where we previously may have been calling `hasattr(OpOverloadPacket, ...)`, that can be avoided (hasattr will go through getattr, which, for OpOverloadPacket, will do a lookup in the dispatch table for all overload names of the packet). Test Plan: CI Differential Revision: D60048270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131348 Approved by: https://github.com/davidberard98	2024-07-23 04:30:04 +00:00
Li-Huai (Allan) Lin	99d9b369f4	[Optim] Support tensor lr for all optimizers and check it is 1-element (#131065 ) Fixes: #130980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065 Approved by: https://github.com/janeyx99	2024-07-23 04:27:05 +00:00
nikonikolov	781189f25d	Add `nvjitlink` to the list of loadable global deps (#131295 ) To fix the cusparse dependency resolution in CUDA-12.x, that has nvJitLink dependency: ``` $ ldd -r /usr/local/cuda-11.8/lib64/libcusparse.so.11.7.5.86 linux-vdso.so.1 (0x00007ffea6f51000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb13306f000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb133065000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb13305f000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb132f10000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb132eeb000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb132cf7000) /lib64/ld-linux-x86-64.so.2 (0x00007fb143db7000) $ ldd -r /usr/local/cuda-12.1/lib64/libcusparse.so.12.1.0.106 linux-vdso.so.1 (0x00007ffc41909000) libnvJitLink.so.12 => /usr/local/cuda-12.1/lib64/libnvJitLink.so.12 (0x00007f3916b38000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3916aea000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3916ae0000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3916ada000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f391698b000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f3916964000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3916772000) /lib64/ld-linux-x86-64.so.2 (0x00007f3929a8c000) ``` Fixes #131284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131295 Approved by: https://github.com/malfet	2024-07-23 04:26:33 +00:00
Nikita Shulga	02cd4dbcf4	[BE][CI] Get rid of duplicated code (#131406 ) Followup after https://github.com/pytorch/pytorch/pull/131061 Define `run_if_exists` function that runs cpp test if it exists and prints a warning otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131406 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-07-23 04:01:13 +00:00
Wanchao Liang	35a0e0f018	[tp] improve SequenceParallel and its documentation (#131346 ) SequenceParallel style assumes the input torch.Tensor ALREADY sharded on the sequence dimension if not passing in DTensor. Since it causes some user confusion on the documentation, this PR: 1. for the case where input passed in is already a DTensor, we check the input placements and redistribute if it's not sharded on the sequence dimension 2. update the doc to make it more explicit about the case when user passed in a torch.Tensor and DTensor This would fix https://github.com/pytorch/pytorch/issues/129355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131346 Approved by: https://github.com/awgu	2024-07-23 03:57:01 +00:00
Wanchao Liang	12434504a2	[c10d] remove non-necessary tests (#131212 ) as titled, comm tensor is not being actively used as we approached the functional collectives as our collective tracing approach Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212 Approved by: https://github.com/XilunWu	2024-07-23 03:48:55 +00:00
zengxian	8a591da3e7	[CI] Enable AOT inductor in cpu performance smoke test (#130097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097 Approved by: https://github.com/chuanqi129, https://github.com/desertfire	2024-07-23 03:44:13 +00:00
PyTorch MergeBot	6cbb1437c1	Revert "Add sparse block to flex_decoding kernel (#130884 )" This reverts commit 0bf59db6cc076468f44197f0d7ee41f6204c47c2. Reverted https://github.com/pytorch/pytorch/pull/130884 on behalf of https://github.com/atalman due to Sorry reverting test_causal_full_mask_vs_sdpa constantly failing on trunk ([comment](https://github.com/pytorch/pytorch/pull/130884#issuecomment-2244113663))	2024-07-23 02:10:14 +00:00
Atul Jangra	28b0ad4f46	[PT2] Minor fix in signpost (#131332 ) Summary: compile_id is a named Tuple. We want to log signposts. Test Plan: Run e2e job. Confirm this shows up correctly. {F1767320364} Differential Revision: D60045020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131332 Approved by: https://github.com/oulgen	2024-07-23 01:56:00 +00:00
PyTorch MergeBot	b435d84261	Revert "[custom ops] Add register_vmap for custom ops (#130589 )" This reverts commit 074b42064195c45471912f851e94c753992a9a1f. Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))	2024-07-23 01:44:44 +00:00
wizzniu	8963623494	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-23 01:44:15 +00:00
Shangdi Yu	074b420641	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 00:54:52 +00:00
Avik Chaudhuri	1e5ecc4277	move save/load from _export to export (#131353 ) Test Plan: existing tests Differential Revision: D60053905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131353 Approved by: https://github.com/angelayi	2024-07-23 00:48:28 +00:00
angelayi	26f7dd286b	[export] Allow non-CIA ops to be preserved (#131075 ) I feel like the semantics of `run_decompositions(preserve_ops,...)` should be that we should always preserve whatever operator is put into `preserve_ops`, even if it's not CIA? Pull Request resolved: https://github.com/pytorch/pytorch/pull/131075 Approved by: https://github.com/bdhirsh	2024-07-23 00:41:48 +00:00
Jeff Daily	69b1999586	TunableOp size hotfix (#130800 ) Fixes #130727. GetSize calculation was incorrect for strided batched gemm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130800 Approved by: https://github.com/xw285cornell	2024-07-22 23:42:26 +00:00
Thomas Ortner	8ae1963a61	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-22 23:18:19 +00:00
PyTorch MergeBot	c74396e890	Revert "[c10d] remove non-necessary tests (#131212 )" This reverts commit 0c074352ab62acba22265d8f19ea95851ae61d0f. Reverted https://github.com/pytorch/pytorch/pull/131212 on behalf of https://github.com/atalman due to sorry need to revert breaks OSS CI, module 'test_c10d_common' has no attribute 'CompilerTest' ([comment](https://github.com/pytorch/pytorch/pull/131212#issuecomment-2243961785))	2024-07-22 23:11:44 +00:00
PyTorch MergeBot	f8f41dcb24	Revert "[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832 )" This reverts commit deacc543f13067ab22e8fb2ab714a20dd60bb056. Reverted https://github.com/pytorch/pytorch/pull/130832 on behalf of https://github.com/atalman due to broke periodic test ([comment](https://github.com/pytorch/pytorch/pull/130832#issuecomment-2243894772))	2024-07-22 22:10:02 +00:00
PyTorch MergeBot	15eb10df02	Revert "[inductor] Use multiple outputs for flex-attention (#130833 )" This reverts commit 9df8ea1cf2d62bfe21b46188faea6ef2e29e5210. Reverted https://github.com/pytorch/pytorch/pull/130833 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130833#issuecomment-2243890944))	2024-07-22 22:07:06 +00:00
PyTorch MergeBot	f8875e8277	Revert "[inductor] Kill mark_node_as_mutating (#130834 )" This reverts commit 33f036a6f71b386d4ccb9a756ed892c144ec6a5f. Reverted https://github.com/pytorch/pytorch/pull/130834 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130834#issuecomment-2243886215))	2024-07-22 22:02:43 +00:00
Yifu Wang	d33804f8b6	Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130842 Approved by: https://github.com/fegin	2024-07-22 21:49:33 +00:00
Yifu Wang	a136a7d623	[Functional Collective] enable custom work registration from python (#130354 ) This PR does two things: - Allow tensor -> work registration in Python via `torch._C._distributed_c10d.register_work`. Calling `torch.ops._c10d_functional.wait_tensor` on a tensor would trigger `.wait()` on the registered work object. - Allow user-defined work object in Python to work with functional collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130354 Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wconstab	2024-07-22 21:45:19 +00:00
Catherine Lee	a3922acc06	[TD] More synonyms, new heuristic for test_public_bindings (#130397 ) test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/ flex_attention should be run on anything involving autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397 Approved by: https://github.com/malfet	2024-07-22 21:42:54 +00:00
joydddd	0bf59db6cc	Add sparse block to flex_decoding kernel (#130884 ) fix typo Finish flex_decoding block sparse Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884 Approved by: https://github.com/drisspg	2024-07-22 21:29:43 +00:00
Catherine Lee	83b355bad5	[aoti] forward fix of D60006838, add back test_multiple_output_alias (#131331 ) (#131356 ) Summary: Forward fix of D60006838. The unit test test_multiple_output_alias passed in OSS CI, but failing internally. So adding it back to skip list. Test Plan: ci Differential Revision: D60044926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131356 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-07-22 20:17:21 +00:00
Xu Zhao	e3eaa22126	[torchbench][multisect] Run accuracy check at Diff time (#131266 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2388 We can enable accuracy checks at Diff time since it is not a performance metric. * Refactor the existing diff time test to use the new PT2 Benchmark Runner. * Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection. Test Plan: Sandcastle CI Or buck test command: ``` buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429 Reviewed By: oulgen Differential Revision: D59825601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266 Approved by: https://github.com/oulgen	2024-07-22 20:14:28 +00:00
Wanchao Liang	0c074352ab	[c10d] remove non-necessary tests (#131212 ) as titled, comm tensor is not being actively used as we approached the functional collectives as our collective tracing approach Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212 Approved by: https://github.com/XilunWu	2024-07-22 19:52:44 +00:00
Thanh Ha	781a33f5d8	Enable dynamic rollout for Linux trunk workflows (#131325 ) Enables dynamic migration of jobs to the LF AWS account for the Linux trunk workflow. The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132 This closes pytorch/ci-infra#250. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131325 Approved by: https://github.com/ZainRizvi	2024-07-22 19:43:24 +00:00
Shuqiang Zhang	406f510f89	[c10d] add bfloat16 support for NAN check (#131131 ) Summary: Need another dispacher macro to support more data types Test Plan: (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16 /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion `!isnan(data[i])` failed. /home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18: checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion `!isnan(data[i])` failed. NCCL version 2.21.5+cuda12.0 devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure 'device-side assert triggered' . ---------------------------------------------------------------------- Ran 1 test in 9.416s OK Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131131 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-07-22 19:41:19 +00:00
Xiaodong Wang	9e753d1f20	[AMD] catch exception when other processes belong to other users (#131018 ) Summary: It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out: ``` torch.cuda.memory.list_gpu_processes() File "torch/cuda/memory.py", line 647, in list_gpu_processes procs = amdsmi.amdsmi_get_gpu_process_list(handle) # type: ignore[attr-defined] File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list _check_res( File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res raise AmdSmiLibraryException(ret_code) amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code: 10 \| AMDSMI_STATUS_NO_PERM - Permission Denied ``` So just catch this error Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails Differential Revision: D59901053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018 Approved by: https://github.com/eqy, https://github.com/clee2000	2024-07-22 19:38:51 +00:00
Andrew Gu	23ae6e2eb3	[FSDP2] Removed state dict error for HSDP (#131320 ) Fixes https://github.com/pytorch/torchtitan/issues/441#issuecomment-2241288906. This PR avoids raising the 2D state dict error for HSDP, which does not depend on strided sharding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131320 Approved by: https://github.com/wanchaol, https://github.com/weifengpy	2024-07-22 19:23:17 +00:00
Mikayla Gawarecki	d3556786b8	Blocklist certain modules for weights_only load (#131259 ) Also bold certain text in the error message as suggested <img width="3000" alt="Screenshot 2024-07-19 at 5 56 48 PM" src="https://github.com/user-attachments/assets/378f20c5-c6b2-4e53-8eaf-0bd26c3a6b60"> With a GLOBAL like `os.execv` the error message is now as such ```python File "/data/users/mg1998/pytorch/torch/serialization.py", line 1256, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Trying to load unsupported GLOBAL posix.execv whose module posix is blocked. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131259 Approved by: https://github.com/malfet, https://github.com/albanD	2024-07-22 18:23:21 +00:00
William Wen	93ef2e53f8	[3.13, dynamo] support FORMAT_SIMPLE/FORMAT_SPEC (#130751 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130751 Approved by: https://github.com/Skylion007 ghstack dependencies: #130566, #130567, #130568, #130569	2024-07-22 18:07:40 +00:00
William Wen	375a4d7e9e	[3.13, dynamo] decompose fused load/store instructions (#130569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130569 Approved by: https://github.com/jansel ghstack dependencies: #130566, #130567, #130568	2024-07-22 18:07:40 +00:00
William Wen	157f38bc4d	[3.13, dynamo] support STORE_FAST_LOAD_FAST (#130568 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130568 Approved by: https://github.com/jansel ghstack dependencies: #130566, #130567	2024-07-22 18:07:35 +00:00
William Wen	1e116c7a1e	[3.13, dynamo] fix END_FOR (#130567 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130567 Approved by: https://github.com/jansel ghstack dependencies: #130566	2024-07-22 18:07:32 +00:00
William Wen	4319147ca9	[3.13, dynamo] fix closures, MAKE_FUNCTION, LOAD_CLOSURE; support SET_FUNCTION_ATTRIBUTE (#130566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130566 Approved by: https://github.com/jansel	2024-07-22 18:07:28 +00:00
PyTorch MergeBot	44e689d947	Revert "[TD] More synonyms, new heuristic for test_public_bindings (#130397 )" This reverts commit d8a35d57220cdd5ed2fe52c02bb1f78cc0b3c75b. Reverted https://github.com/pytorch/pytorch/pull/130397 on behalf of https://github.com/clee2000 due to broke lint, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130397#issuecomment-2243518651))	2024-07-22 18:03:22 +00:00
Xiaodong Wang	56bb047449	[pt2] Increase dynamo/inductor default log level to info (#131311 ) Summary: Avoid the logs to be too verbose Test Plan: CI Differential Revision: D60028647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131311 Approved by: https://github.com/oulgen	2024-07-22 17:33:29 +00:00
Catherine Lee	d8a35d5722	[TD] More synonyms, new heuristic for test_public_bindings (#130397 ) test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/ flex_attention should be run on anything involving autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397 Approved by: https://github.com/malfet	2024-07-22 17:06:00 +00:00
PyTorch MergeBot	b9912f31ef	Revert "[export] fix zero arg export in training_ir (#130990 )" This reverts commit 50436d5bdb5d2e29307a0c0bcfcce8d7e2da82c0. Reverted https://github.com/pytorch/pytorch/pull/130990 on behalf of https://github.com/clee2000 due to failing some executorch and torchrec tests internally D60006710 ([comment](https://github.com/pytorch/pytorch/pull/130990#issuecomment-2243395316))	2024-07-22 16:49:25 +00:00
zdevito	32c2f84e34	Support IPC for Expandable Segments (#130890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2 ghstack dependencies: #130888, #130889	2024-07-22 16:15:01 +00:00
Henry Tsang	0246b28510	[aoti] refactor aoti_torch__scaled_mm and skip aoti fp8 test for some cases (#130868 ) Continuing https://github.com/pytorch/pytorch/pull/128683 and https://github.com/pytorch/pytorch/pull/130582. The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well. Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now. The reason this wasn't known earlier is probably because the CI doesn't use H100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130868 Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire	2024-07-22 15:24:20 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
Miguel Perez	5c78581fc9	Fix documentation for tensor.repeat. (#131195 ) Fixes #130930. Adjusts the documentation which used `sizes` instead of `repeats`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131195 Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer	2024-07-22 14:48:18 +00:00
PyTorch MergeBot	26383a6cc0	Revert "Added and_masks and or_masks utilities (#131073 )" This reverts commit 92bb323d36adca097c44a2fc8d9f0d574214d801. Reverted https://github.com/pytorch/pytorch/pull/131073 on behalf of https://github.com/albanD due to The docs build fails here and in trunk ([comment](https://github.com/pytorch/pytorch/pull/131073#issuecomment-2242997958))	2024-07-22 13:44:55 +00:00
Thanh Ha	3eb9fa5d58	Add support for using LF Canary runners (#131188 ) The script is updated such that if a canary build is detected and the label_type is LF runner it will run on an LF Canary runner. Closes pytorch/ci-infra#245. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131188 Approved by: https://github.com/ZainRizvi	2024-07-22 13:26:46 +00:00
eqy	69e2590490	Fix MKLDNN check in `test_aot_inductor.py` (#130982 ) `torch.ops.mkldnn._is_mkldnn_bf16_supported()` assumes MKLDNN is on the system which isn't the case for e.g., some ARM system configurations CC @tinglvv @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/130982 Approved by: https://github.com/malfet	2024-07-22 11:58:18 +00:00
chilli	92bb323d36	Added and_masks and or_masks utilities (#131073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131073 Approved by: https://github.com/drisspg ghstack dependencies: #130871, #130904	2024-07-22 11:48:03 +00:00
PyTorch UpdateBot	68df24f9b6	[xla hash update] update the pinned xla hash (#126672 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126672 Approved by: https://github.com/pytorchbot	2024-07-22 11:35:36 +00:00
Wang, Eikan	6d65a2c3f4	[3/N] Non-Tensor: Support string parameter for aten operations (#125831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-07-22 09:42:35 +00:00
xinan.lin	8da19fec60	[Inductor] Support store SPIR-V binary file output from Intel Triton. (#130849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130849 Approved by: https://github.com/peterbell10, https://github.com/EikanWang	2024-07-22 05:59:03 +00:00
albanD	2820e1d9f8	Update CPython support policy (#130989 ) Update as specified in the RFC that was accepted: https://github.com/pytorch/rfcs/blob/master/RFC-0038-cpython-support.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/130989 Approved by: https://github.com/seemethere	2024-07-22 05:29:07 +00:00
Florian	1614891946	[Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733 ) Fixes #[130730](https://github.com/pytorch/pytorch/issues/130730) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130733 Approved by: https://github.com/aaronenyeshi	2024-07-22 04:35:21 +00:00
Nikita Shulga	c2425a3b57	[BE] Use `_linux-build.yml` instead of `-linux-build-label.yml` flavor (#130762 ) It was also introduced during the ARC experiment and supposed to be a temporary thing. Fix `use_split_build` option handling in `_linux_build.yml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130762 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/jeanschmidt	2024-07-21 23:17:17 +00:00
Tom Ritchford	500cbb5b90	Add decomposition for view_copy (#130938 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938 Approved by: https://github.com/peterbell10 ghstack dependencies: #130937	2024-07-21 20:39:24 +00:00
Tom Ritchford	f628813066	Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937 ) * See #128416 and #129476 * Simplify xskip lists in test/functorch/test_ops.py * Add supports_out=True to OpInfos for copy ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937 Approved by: https://github.com/peterbell10	2024-07-21 20:39:24 +00:00
Aaron Orenstein	b193894b94	FakeTensor cache SymInt support (#127596 ) Adds support for SymInts in the FakeTensor cache. A couple notes: 1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain. 2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail. 3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not). 4. In the cache entry we store SymInts as a _SymIntOutputStub. Perf example: ``` python benchmarks/dynamo/timm_models.py --ci --accuracy --timing --explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda --training --amp --total-partitions 2 --partition-id 0 --output /tmp/training_timm_models.csv --filter crossvit_9_240 ``` fake tensor cache before: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 68137 INFO: cache_misses: 837 INFO: cache_bypasses: INFO: symbolic shape: 48224 INFO: CompositeImplicitAutograd: 917 INFO: non-fake tensor: 70 INFO: non-FakeTensor output: 62 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` and after: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 88187 INFO: cache_misses: 14233 INFO: cache_bypasses: INFO: CompositeImplicitAutograd: 1037 INFO: non-FakeTensor output: 602 INFO: non-fake tensor: 70 INFO: unsafe view: 36 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596 Approved by: https://github.com/eellison ghstack dependencies: #131014, #129780	2024-07-21 19:26:38 +00:00
Aaron Orenstein	ebce85172e	FakeTensor cache SymInt support: flatten cache key (#129780 ) This is part of #127596, pulled out to make reviewing a little easier. Flatten the FakeTensor cache key - so it's a list of singular elements and pointing at one requires a single index rather than a PyTree path. This is used in the next PR to allow us to have the cache entry refer to an input SymInt that it needs to copy directly into the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129780 Approved by: https://github.com/oulgen, https://github.com/eellison ghstack dependencies: #131014	2024-07-21 19:26:38 +00:00
Aaron Orenstein	f3562e2cdc	backport dataclass(slots=True) (#131014 ) Python 3.10 adds `@dataclass(slots=True)` to auto-build the `__slots__` for a dataclass. This is really useful but we can't use it until 3.10 becomes our minimum version. Copied the code for that functionality from python into a new decorator and ported it to use 3.8 syntax (removed use of `match`). Usage: ``` @dataclass_slots @dataclass class X: pass ``` is the same as (in py3.10): ``` @dataclass(slots=True) class X: pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131014 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-07-21 19:26:31 +00:00
Xuehai Pan	1439bd3c9c	[Easy][pytree] enable CXX pytree under `torch::deploy` (#130144 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130144 Approved by: https://github.com/zou3519 ghstack dependencies: #130895, #130139	2024-07-21 07:36:22 +00:00
Animesh Jain	ddde9dd25c	[dynamo][automatic_dynamic] Trigger dynamism on stride changes (#130232 ) Fixes https://github.com/pytorch/pytorch/issues/129798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130232 Approved by: https://github.com/ezyang	2024-07-21 03:45:54 +00:00
Chuanhao Zhuge	e506dfa640	[dynamo] Add a JK kill switch for disabling compile (#131258 ) Summary: The JK disables dynamo by passing None to set_eval_frame. Test Plan: Ran buck test mode/opt caffe2/test/dynamo:test_dynamo Buck UI: https://www.internalfb.com/buck2/1fec33b4-c95a-4bdf-b47b-7c0b8ab9e24a Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814750010105363 Network: Up: 0B Down: 0B Jobs completed: 9596. Time elapsed: 28:54.5s. Tests finished: Pass 4796. Fail 0. Fatal 0. Skip 17. Build failure 0 Also manually write a small local test with torch.compile and toggles the code to see if PT2 can be disabled. Validated with running the test and observing the log. PT2 enabled: P1486847242. Can see dynamo log about graph breaks. PT2 disabled: P1486847727. No dynamo log. The newly added warning printed. Reviewed By: ezyang Differential Revision: D59968925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131258 Approved by: https://github.com/c00w	2024-07-21 01:22:31 +00:00
cyy	1d1d074072	[3/N] Fix Wunused-parameter warnings (#131271 ) Follows #131170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131271 Approved by: https://github.com/ezyang	2024-07-20 23:31:03 +00:00
Shan19900305	d57af32e63	Fix undefined tensor error in _copy_from_and_resize when fallback to cpu. (#130237 ) 1) Add skip undefined tensor in cpu fallback when call _copy_from_and_resize; 2) Modify to_cpu function support optional tensor; 3) Add copy back to origin optional tensor when alias_info isWrite is true. @ezyang @bdhirsh Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130237 Approved by: https://github.com/ezyang	2024-07-20 23:12:17 +00:00
Tristan Rice	13283fb4bc	[distributed] test_store: remove flaky bind test (#131262 ) Fixes https://github.com/pytorch/pytorch/issues/131084 There's no good way to fix this since some tests environments can bind the protected range. Removing test since the value is relatively low since it's just testing error messages. Test plan: ``` python test/distributed/test_store.py -v -k address ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131262 Approved by: https://github.com/mori360, https://github.com/XilunWu	2024-07-20 23:04:31 +00:00
Anshul Sinha	407c87a32c	[debug][dtensor] fixed updating current module (#130995 ) Summary Fixed issue with updating the current module when transitioning between child module to parent module and in the backward pass. The first issue is caused because the prehook is not called again when we go back to the parent module and that the hook being used was a register_module_forward_hook, which runs before the register_module_hook used in redistribute, causing the collective call to be assigned to the incorrect module. In order to do this, I updated the current module to be the parent module in a register_forward_hook in the module tracker. The second issue was caused by the parent set in the module tracker I inherit from being incorrect. I fixed this issue by saving the parents of each module and using them in collective counter instead of the incorrect set. I have updated the example in module_operation_tracing to reflect the correct output. In addition, I changed the test cases that used the incompatible old CommDebugMode. Test Case 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 2. pytest test/distributed/_tensor/debug/test_comm_mode_features.py -s -k test_transformer_module_tracing 3. python test/distributed/_composable/fsdp/test_fully_shard_training.py -k TestFullyShardGradientAccumulation.test_gradient_accumulation 4. python test/distributed/_tensor/test_math_ops.py -k DistMathOpsTest.test_layer_norm_bwd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130995 Approved by: https://github.com/XilunWu ghstack dependencies: #130410	2024-07-20 20:57:29 +00:00
Peter Bell	33f036a6f7	[inductor] Kill mark_node_as_mutating (#130834 ) Resubmit of #129346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834 Approved by: https://github.com/lezcano ghstack dependencies: #130831, #130832, #130833	2024-07-20 18:53:33 +00:00
Nikita Shulga	fccbe85475	[BE] Improve CUDA UpSample error message (#131252 ) `Expected grad_output.numel() <= std::numeric_limits<int32_t>::max() to be true` is not very helpful, it's better to mention method name as well as actual tensor size This error was reported in https://github.com/pytorch/pytorch/issues/131185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131252 Approved by: https://github.com/albanD	2024-07-20 16:49:34 +00:00
PyTorch UpdateBot	a7a951a4ae	[executorch hash update] update the pinned executorch hash (#130001 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Co-authored-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001 Approved by: https://github.com/pytorchbot	2024-07-20 16:44:07 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
Soumith Chintala	8e478d4fb1	Add Alban and Piotr into Core Maintainers (#130903 ) See official announcement here: https://dev-discuss.pytorch.org/t/alban-desmaison-and-piotr-bialecki-are-now-pytorch-core-maintainers/2280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130903 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-20 16:02:42 +00:00
hongxyan	637ab85e7f	fix for launching kernel invalid config error when calling embedding … (#130994 ) …with large index Fixes #130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](`39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)`): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell	2024-07-20 08:33:29 +00:00
Wu, Chunyuan	a8319698b3	[inductor] [cpp] improve cache blocking with CPU info (#129348 ) ## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% ### FP32 single thread (measured on Ice Lake) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #130675, #130690	2024-07-20 06:53:31 +00:00
Jiong Gong	0b44e1a74c	[inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690 ) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130675	2024-07-20 06:30:15 +00:00
ankurneog	ebc012ace6	Add hooks for execution on intel gaudi devices - 1 (#128584 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970 to support Gaudi devices for Pytorch UT execution. ## Changes We are adding additional hooks to: 1. Add dtype exceptions for Gaudi/HPU 2. Extend onlyNativeDevices decorator functionality to add additional devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584 Approved by: https://github.com/albanD	2024-07-20 05:03:36 +00:00
Michael Lazos	d31f2ae904	Ensure invariant that all inputs have tensor dict (#131249 ) There was a path with freezing enabled that violated the invariant that all inputs have the "tensor_dict" meta. This ensures that `register_attr_or_module` also sets tensor_dict meta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131249 Approved by: https://github.com/anijain2305	2024-07-20 04:40:58 +00:00
drisspg	37337ef5c3	add some description on create_block_mask and mask mods (#131209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131209 Approved by: https://github.com/joydddd	2024-07-20 04:40:48 +00:00
Yifu Wang	168c0e24a5	[IntraNodeComm] Fix some issues in two-shot all-reduce (#131244 ) Two issues: - Similar to https://github.com/pytorch/pytorch/pull/129501, two-shot all-reduce's reduction order was different across ranks. This PR fixes it. - When migrated to use SymmetricMemory, I accidentally used `get_buffer_ptrs_dev` instread of `get_buffer_ptrs` (the former is an on-device array). This PR fixes it (for https://github.com/pytorch/pytorch/issues/131215). The failing snippet provided by https://github.com/pytorch/pytorch/issues/131215 now works. ```python import os import torch import torch.distributed as dist def _get_global_rank() -> int: return int(os.environ.get("LOCAL_RANK", "0")) def is_local(): return _get_global_rank() == 0 def _get_world_size() -> int: return int(os.environ.get("LOCAL_WORLD_SIZE", "1")) global_rank = _get_global_rank() world_size = _get_world_size() torch.cuda.set_device(global_rank) dist.init_process_group(backend="nccl") global_group = dist.group.WORLD draft_group = dist.new_group([0,1]) inp = torch.full((128, 1, 4096), global_rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=global_group) expect = sum(range(world_size)) assert inp.eq(expect).all() if 0 <= global_rank < 2: inp = torch.full((128, 1, 2048), global_rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=draft_group) expect = sum(range(2)) assert inp.eq(expect).all() torch.cuda.synchronize() print("success") dist.destroy_process_group() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131244 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-07-20 02:51:45 +00:00
Xuehai Pan	d2bd9acabd	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519 ghstack dependencies: #130895	2024-07-20 02:41:10 +00:00
Yidi Wu	50436d5bdb	[export] fix zero arg export in training_ir (#130990 ) Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing. edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure: a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state. The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990 Approved by: https://github.com/pianpwk	2024-07-20 02:35:13 +00:00
Sam Larsen	3c43fe068f	[inductor] parallel compile: Create new pipes for subproc communication (#131194 ) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. https://github.com/pytorch/pytorch/issues/131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in https://github.com/pytorch/pytorch/issues/131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman	2024-07-20 02:23:01 +00:00
Peter Bell	9df8ea1cf2	[inductor] Use multiple outputs for flex-attention (#130833 ) Resubmit of #129344 This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833 Approved by: https://github.com/lezcano ghstack dependencies: #130831, #130832	2024-07-20 02:05:10 +00:00
Peter Bell	deacc543f1	[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832 ) Resubmit of #129325 Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832 Approved by: https://github.com/lezcano ghstack dependencies: #130831	2024-07-20 02:05:10 +00:00
Peter Bell	27c2a0d63b	[inductor] Separate Buffer and Operation into two concepts (#130831 ) Resubmit of #128893 Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831 Approved by: https://github.com/lezcano	2024-07-20 02:05:07 +00:00
Isuru Fernando	bb4251213b	Add decomposition for channel_shuffle (#118775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775 Approved by: https://github.com/peterbell10	2024-07-20 01:24:41 +00:00
Xuehai Pan	f0075c179b	Pin `sympy >= 1.13.0` (#130895 ) ------ The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8. - #130836 See the PR description of #130836 for more details. `sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically: - Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations) > BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x2.0 != x2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr)) `sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689. - #130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895 Approved by: https://github.com/ezyang	2024-07-20 00:59:24 +00:00
PyTorch MergeBot	30d1826b2b	Revert "[executorch hash update] update the pinned executorch hash (#130001 )" This reverts commit 4821f72457afd7b1b5c61c1c8c3c49105c1bd22d. Reverted https://github.com/pytorch/pytorch/pull/130001 on behalf of https://github.com/clee2000 due to the test_sympy_utils failure is real, Dr. CI is wrong https://github.com/pytorch/pytorch/actions/runs/10015433275/job/27687163560 `4821f72457` ([comment](https://github.com/pytorch/pytorch/pull/130001#issuecomment-2240807631))	2024-07-20 00:56:14 +00:00
cyy	cd8bbdc71a	[2/N] Fix Wunused-parameter warnings (#131170 ) Follows #130924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131170 Approved by: https://github.com/mikaylagawarecki	2024-07-19 23:58:56 +00:00
rzou	207fb96155	[functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191 ) There's no reason to ban them for vmap or jvp, because without the {grad, vjp} transforms those just act above PyTorch autograd, which will end up saving regular Tensors. Test Plan: - some tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191 Approved by: https://github.com/drisspg	2024-07-19 23:16:27 +00:00
PyTorch UpdateBot	4821f72457	[executorch hash update] update the pinned executorch hash (#130001 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001 Approved by: https://github.com/pytorchbot	2024-07-19 23:10:20 +00:00
PyTorch MergeBot	7c299b46ca	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit 8390843eba6271dcdbec7d048e9fa4e56d4479d8. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))	2024-07-19 22:58:51 +00:00
Shuo Ding	35bf05561c	[Inductor] B2B-GEMM performance tuning with test (#130778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130778 Approved by: https://github.com/eellison	2024-07-19 22:53:57 +00:00
peaceorwell	6657b14a64	[inductor] Fix the method for checking the variable type of entry.numel (#131026 ) The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer. Co-authored-by: peterbell10 <peterbell10@live.co.uk> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026 Approved by: https://github.com/peterbell10	2024-07-19 22:51:11 +00:00
PyTorch MergeBot	0e72baddf0	Revert "[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 )" This reverts commit 0ca7b6ddd91192ebffd3c88bf314d07ba6cddf50. Reverted https://github.com/pytorch/pytorch/pull/131021 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/131021#issuecomment-2240280827))	2024-07-19 21:56:09 +00:00
Shuqiang Zhang	4aef5a1134	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-19 21:11:26 +00:00
Andrii Grynenko	0ca7b6ddd9	[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 ) Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately. Test Plan: unit test Reviewed By: jamesperng, asiab4, c-p-i-o Differential Revision: D59842868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021 Approved by: https://github.com/asiab4	2024-07-19 20:58:32 +00:00
Zain Rizvi	c64ad2403c	LF runners: Add new runner types for Amazon2023 AMIs (#131246 ) Add new LF runner types with the Amazon2023 ami, matching the change done in https://github.com/pytorch/test-infra/pull/5487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131246 Approved by: https://github.com/malfet	2024-07-19 20:30:41 +00:00
lessw2020	85ca88a2bb	[Distributed][PP export] update tracing to handle autocast inclusion (#130998 ) Fixes https://github.com/pytorch/pytorch/issues/128394 This updates PP export tracing to use no_grad() context along with avoid predispatch. This enables tracing for HF llama models that currently fail due to not handling the use of autocast in the Rope embeddings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130998 Approved by: https://github.com/fduwjj	2024-07-19 20:08:00 +00:00
Yidi Wu	ceee87df2e	[export] modify export code owners (#130894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130894 Approved by: https://github.com/zhxchen17	2024-07-19 19:49:34 +00:00
PyTorch MergeBot	5f981388ec	Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 )" This reverts commit d7a78ec8b938a61297221912464f5afef288b823. Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143))	2024-07-19 19:46:26 +00:00
Li-Huai (Allan) Lin	125be005eb	[Docs] Fix fake tensor doc (#131205 ) Fix this: `# AttributeError: 'FakeTensorMode' object has no attribute 'from_real_tensor'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131205 Approved by: https://github.com/eellison	2024-07-19 17:59:45 +00:00
Animesh Jain	e49c0acc39	[dynamo] Revert https://github.com/pytorch/pytorch/pull/130416 (#131058 ) All the changes brought by the original PR have been addressed in alternative ways in the stack. Why the original PR has to be reverted requires more effort because there is some bad interaction with export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131058 Approved by: https://github.com/williamwen42	2024-07-19 17:26:24 +00:00
henrylhtsang	042be441ba	[aoti] Unskip some aot inductor tests (#130973 ) Trying to unskip some tests, and if they are still broken, add reasons. ## example testing command ``` pytest -v test/inductor/test_aot_inductor.py -k test_add_complex ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130973 Approved by: https://github.com/ColinPeppler	2024-07-19 17:19:35 +00:00
Jiashen Cao	9b5c70878b	[Fix] Missing parameter happens when retracing an already jit.scripted module (#129787 ) #### Issue Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers. #### Test Plan * `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787 Approved by: https://github.com/angelayi	2024-07-19 16:58:48 +00:00
Zhengxu Chen	abb3f2822c	[aotinductor] Support additional lifted constants supplied to const folding. (#130743 ) Summary: In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph. This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness. This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms. Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding. For the internal followup counterpart, see D59685145 Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm Differential Revision: D59692790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743 Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad	2024-07-19 16:48:56 +00:00
Catherine Lee	31e79aae6a	Another follow up to #130260 (#130993 ) Another followup to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993 Approved by: https://github.com/huydhn	2024-07-19 16:43:54 +00:00
xingyunjohn1	d4a79d4a7c	Fix an example: Resolve broadcasting error in attn_bias and attn_mask… (#130209 ) … addition, fix device assignment for newly created variables in method Fix an example: Resolve broadcasting error in attn_bias and attn_mask addition, fix device assignment for newly created variables in method 1. `attn_bias += attn_mask` would cause a broadcasting error. Because the shape of `attn_bias` is (L, S), the shape of the output would be expected as (L, S) too. When the shape of input is (N, num_heads, L, S), a broadcasting should be triggered. Then, the shape of the output would be (N, num_heads, L, S), which is unexpected. 2. `attn_bias` is a newly created variables in method, which is not assigned device. This is my retry of #130200 . I used a wrong account in that pr. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130209 Approved by: https://github.com/mikaylagawarecki	2024-07-19 15:23:22 +00:00
sradc	451fc029fe	docs: note transposed weight initialisations (#130122 ) Fixes #129834 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130122 Approved by: https://github.com/mikaylagawarecki	2024-07-19 15:23:03 +00:00
PyTorch MergeBot	5f3d8b8788	Revert "[c10] add an option to pg_config split share (#130877 )" This reverts commit 367213a608528ee74e67e03bf11f775e263ef480. Reverted https://github.com/pytorch/pytorch/pull/130877 on behalf of https://github.com/atalman due to breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/130877#issuecomment-2239298810))	2024-07-19 14:24:50 +00:00
Andres Suarez	25d8a0480b	[lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions Differential Revision: D59935630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131187	2024-07-19 07:19:11 -07:00
Edward Z. Yang	a6a2cd6257	Typo fix (#131037 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131037 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-19 13:17:54 +00:00
Michael Lazos	1b72cf0b09	Add hasattr for tensor variable (#131008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131008 Approved by: https://github.com/anijain2305 ghstack dependencies: #131007	2024-07-19 12:43:27 +00:00
Syed Tousif Ahmed	1f961ad495	Runs aten cuda cpp tests in CI (#131061 ) It seems like these tests are never run because https://github.com/pytorch/pytorch/pull/99956 got rid of the `pushd $1` which would make the if conditions true in CUDA builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131061 Approved by: https://github.com/malfet, https://github.com/eqy	2024-07-19 12:35:33 +00:00
Jack Taylor	d7a78ec8b9	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-07-19 09:45:03 +00:00
cyy	feef057691	[1/N] Fix Wunused-parameter warnings (#130924 ) Before we can turn Wunused-parameter into an error Pull Request resolved: https://github.com/pytorch/pytorch/pull/130924 Approved by: https://github.com/ezyang	2024-07-19 06:14:51 +00:00
Oguz Ulgen	eee76c86a8	Write trace_structured events to scuba (#130955 ) Summary: https://fb.workplace.com/groups/1286739428954016/posts/1287192258908733 Test Plan: Run test with tlparse and inspect https://www.internalfb.com/intern/scuba/query/?dataset=pt2_trace_structured_events Differential Revision: D59866096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130955 Approved by: https://github.com/ezyang	2024-07-19 06:02:47 +00:00
Chirag Pandya	982309b501	Initial commit of flight recorder trace (#130764 ) Summary: `fr_trace.py` is used to analyze flight recorder dump files. This script was taken from @wconstab and @zdevito. Only minor changes made were to make the linter happy and add a few odd new fields that I added in version `2.2` of the collector portions. Test Plan: Tested manually on some flight recorder data and it seems to run. TODO: Address 15 odd `#type: ignore` that I put in there to make the linter happy for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130764 Approved by: https://github.com/fduwjj	2024-07-19 06:00:54 +00:00
Justin Chu	fd4899bc58	[ONNX] Run ruff pyupgrade to update type annotations (#130657 ) Use the newest syntax for type annotations Pull Request resolved: https://github.com/pytorch/pytorch/pull/130657 Approved by: https://github.com/titaiwangms	2024-07-19 05:09:44 +00:00
kausik	4f60a2e39c	Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953 ) Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization. Please refer: https://github.com/pytorch/pytorch/pull/121450 However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow. This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-07-19 04:58:02 +00:00
chilli	d59803fb67	Refactored flexattention kernel (#130904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130904 Approved by: https://github.com/drisspg ghstack dependencies: #130871	2024-07-19 04:56:32 +00:00
Animesh Jain	ac76dd606f	[dynamo] Alternative way to skip empty hooks guards on inbuilt nn modules (#131057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131057 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #131056	2024-07-19 04:42:38 +00:00
Animesh Jain	00e54e74ff	[dynamo][cpp-guards] Fix bug in dict tags (#131056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131056 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-07-19 04:42:38 +00:00
Peter Bell	3c622fbcd3	[inductor] Fix var_to_range in IndexPropagation (#130984 ) The current code assumes that indirect variables will be created by the same `IndexPropagation` instance, however that isn't true in the case of masked sub-blocks where we take in variables from the parent block. This fixes the issue by moving the var range information up to the `LoopBody` object where it can be shared by all sub-blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130984 Approved by: https://github.com/lezcano	2024-07-19 03:08:00 +00:00
Feng Yuan	b556d31586	Update torch-xpu-ops pin (ATen XPU implementation) (#131015 ) Regular update. 1. New 90 ATen operators and their variants are supported for XPU. 2. Bugfixing: a. Fixing out-of-bound memory access in index_put kernel b. Fixing debug build error 3. Binary change. Split device AOT code of SYCL kernel into multiple libraries to avoid linkage failure. 4. torch-xpu-ops test case enhancement: a. Hook PyTorch testing ob_db to align opInfo configuration with CUDA b. Hook _check_arg_device2 and freeze_rng_state to make XPU happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/131015 Approved by: https://github.com/EikanWang	2024-07-19 02:18:55 +00:00
Ma, Jing1	52cb9abb1d	Add deterministic support in nn.functional.interpolate for XPU (#129864 ) Both for CUDA and XPU, there are no deterministic implementation at native in `aten::upsample_bilinear` and `aten::replication_pad`. CUDA leverage operator decomposition path in frontend hook `nn.functional.interpolate` as its deterministic implentation. XPU backend uses the same solution in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129864 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/EikanWang	2024-07-19 02:15:42 +00:00
Jiong Gong	39493aa934	[inductor][cpp][gemm] move bias add to epilogue (#130675 ) Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.9200 ms 100.0% _linear_pointwise 1.9345 ms 99.3% After AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.8321 ms 100.0% _linear_pointwise 1.9246 ms 95.2% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130675 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-07-19 01:16:34 +00:00
xinan.lin	5a6a806b19	[Inductor UT] Generalize device-bias code in case TestFxGraphCache.test_inductor_counters. (#131006 ) [Inductor UT] Generalize device-bias code in case `TestFxGraphCache.test_inductor_counters`. Fix #131005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131006 Approved by: https://github.com/masnesral	2024-07-19 01:14:22 +00:00
Will Feng	208dffa702	[Compiled DDP] DDP + AC unit test (#130981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130981 Approved by: https://github.com/fegin	2024-07-19 01:07:41 +00:00
cyy	3cc6183ce1	Fix getAugOp error (#131033 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131033 Approved by: https://github.com/ezyang	2024-07-19 01:07:24 +00:00
Xu Han	6e7b9ee8a0	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-18 23:19:38 +00:00
Justin Chu	e880cb2fe0	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-18 22:07:40 +00:00
PyTorch MergeBot	fb3674b1f4	Revert "[Autograd] Cond Higher-Order Operation (#126911 )" This reverts commit f7058b735e52a1d876912f8c96a594673a495007. Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch `f7058b735e` Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))	2024-07-18 22:06:40 +00:00
Jiashen Cao	686b7f046a	[Fix]: TSConverter handles call ops with multiple outputs (#129294 ) #### Issue * Current call ops does not handle IR with multiple outputs. If an op has multiple outputs, we add an implicit unpack to map output. E.g., ``` %5 : Tensor, %6 : Tensor = aten::max(%x.1, %3, %4), scope: export.test_converter.M:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:774:20 ``` * There are some cases that `prim::If` sub-blocks do not return any outputs. E.g., ``` %9 : bool = aten::gt(%8, %3), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:9 = prim::If(%9), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 block0(): -> () block1(): = prim::RaiseException(%5, %4), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 -> () ``` #### Test Plan We did an exhaustive search of all torch APIs that can return multiple outputs. We sample some of common ones and add new test cases based on those. * `pytest test/export/test_converter.py -s -k test_ts2ep_multi_outputs_on_call_ops` #### Appendix * aten ops that return multiple outputs. ``` aten._batch_norm_impl_index aten._batch_norm_no_update aten._batch_norm_with_update aten._batch_norm_with_update_functional aten._cudnn_rnn aten._efficient_attention_backward aten._efficient_attention_forward aten._embedding_bag aten._embedding_bag_forward_only aten._flash_attention_backward aten._flash_attention_forward aten._fused_adam aten._fused_dropout aten._fused_moving_avg_obs_fq_helper aten._linalg_det aten._linalg_eigh aten._linalg_slogdet aten._linalg_solve_ex aten._linalg_svd aten._native_batch_norm_legit aten._native_batch_norm_legit_functional aten._native_batch_norm_legit_no_training aten._pack_padded_sequence aten._prelu_kernel_backward aten._scaled_dot_product_efficient_attention aten._scaled_dot_product_efficient_attention_backward aten._scaled_dot_product_flash_attention aten._scaled_dot_product_flash_attention_backward aten._scaled_dot_product_flash_attention_for_cpu aten._scaled_dot_product_flash_attention_for_cpu_backward aten._thnn_fused_lstm_cell aten._thnn_fused_lstm_cell_backward_impl aten._unique2 aten._weight_norm_interface aten.adaptive_max_pool2d aten.adaptive_max_pool3d aten.aminmax aten.batch_norm_backward aten.convolution_backward aten.cudnn_batch_norm aten.cudnn_batch_norm_backward aten.cummax aten.cummin aten.fractional_max_pool2d aten.frexp aten.grid_sampler_2d_backward aten.grid_sampler_3d_backward aten.gru aten.linalg_cholesky_ex aten.linalg_eig aten.linalg_inv_ex aten.linalg_ldl_factor_ex aten.linalg_lu aten.linalg_lu_factor_ex aten.linalg_qr aten.linear_backward aten.log_sigmoid_forward aten.lstm aten.lu_unpack aten.max aten.max_pool2d_with_indices aten.max_pool3d_with_indices aten.median aten.min aten.miopen_batch_norm aten.miopen_batch_norm_backward aten.mkldnn_rnn_layer aten.mkldnn_rnn_layer_backward aten.mode aten.multilabel_margin_loss_forward aten.nanmedian aten.native_batch_norm aten.native_batch_norm_backward aten.native_dropout aten.native_group_norm aten.native_group_norm_backward aten.native_layer_norm aten.native_layer_norm_backward aten.nll_loss2d_forward aten.nll_loss_forward aten.quantized_gru aten.quantized_lstm aten.rnn_relu aten.rnn_tanh aten.sort aten.std_mean aten.topk aten.triangular_solve aten.unique_dim aten.var_mean ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129294 Approved by: https://github.com/angelayi	2024-07-18 21:55:18 +00:00
Alnis Murtovi	7f1cda1533	Autoheuristic: Do not store choices as metadata (#130304 ) While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`). In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304 Approved by: https://github.com/eellison	2024-07-18 21:39:42 +00:00
zdevito	4d9f2a6d56	Small expandable segments refactor. (#130889 ) Makes next PRs that will export/import segment handles easier to write. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130889 Approved by: https://github.com/dsjohns2 ghstack dependencies: #130888	2024-07-18 21:34:38 +00:00
zdevito	d8fed480ef	Move handle-creation logic into cudacaching allocator. (#130888 ) A later PR will then make the handle abstract and able to use either cudaMalloc or expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130888 Approved by: https://github.com/dsjohns2	2024-07-18 21:34:38 +00:00
Richard Zou	3e9cf1cc80	Fix potential segfault during deletion (#131036 ) Summary: See comment in code Test Plan: code reading Reviewed By: albanD Differential Revision: D59872819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131036 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-07-18 21:18:31 +00:00
Thomas Bohnstingl	f7058b735e	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-18 21:09:09 +00:00
JackCaoG	24467ba2ec	Update pin (#130896 ) Test the XLA pin update Pull Request resolved: https://github.com/pytorch/pytorch/pull/130896 Approved by: https://github.com/anijain2305	2024-07-18 21:04:30 +00:00
Jerry Zhang	793b17ebcb	Add numeric_debugger top level APIs (#130643 ) Summary: Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model and calculate summary for metric comparisons between nodes in two graphs * `prepare_for_propagation_comparison` * `extract_results_from_loggers` * `compare_results` Test Plan: python test/test_quantization.py -k test_prepare_for_propagation_comparison python test/test_quantization.py -k test_extract_results_from_loggers Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-18 20:54:18 +00:00
PyTorch MergeBot	726b9268d2	Revert "Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 )" This reverts commit c986aeea2d7d9403be702119e3dd4dcb18134fc2. Reverted https://github.com/pytorch/pytorch/pull/126376 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/126376#issuecomment-2237496633))	2024-07-18 20:25:20 +00:00
Peter Bell	e7f7c5c3f8	[inductor] Avoid fallback case for custom scan op lowering (#130936 ) We currently can't generate split scans when there are multiple scan values, so we normally fall back to ATen. However, for the higher order scan op, we can't fallback so it makes sense to just generate the slower kernel anyway. This avoids having special shapes where we fail to codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936 Approved by: https://github.com/lezcano	2024-07-18 19:53:47 +00:00
Shuqiang Zhang	367213a608	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-18 19:03:00 +00:00
drisspg	c015e5b9e3	Make sure that TransformGetItemToIndex for all graph replay (#131003 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131003 Approved by: https://github.com/Chillee ghstack dependencies: #130871	2024-07-18 18:32:21 +00:00
redwrasse	82242a258a	rm duplicate index_dtype arg (#130803 ) - Remove duplicate `index_dtype` argument for `_test_meta_sparse_compressed` operation. - Also remove unused `y_v_numel` variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130803 Approved by: https://github.com/soulitzer	2024-07-18 18:30:13 +00:00
joydddd	6d9f74f0af	Add flex decoding benchmark (#130850 ) ghstack-source-id: b4f26fb66ed47907b11580c8c853737959c58811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130788 Add benchmark for flex decoding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130850 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-07-18 18:09:25 +00:00
PyTorch MergeBot	fff92d4f18	Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494 )" This reverts commit 9f392f8294e928aec49599ad649aa899e1356102. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))	2024-07-18 17:42:05 +00:00
Pian Pawakapan	745324e487	[export] turn on hybrid symints by default (#130775 ) Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes. For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert. ``` x = torch.randn(4, 8) # [s0, s1] y = torch.randn(32) # [s2] out = x.reshape(-1) + y # this emits Eq(s0 * s1, s2), and we represent y's shape as [s0s1] in the graph. ``` However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization. We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring. Follow up: Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775 Approved by: https://github.com/avikchaudhuri	2024-07-18 17:40:58 +00:00
Michael Lazos	22388ffe03	Graph break on tostring for numpy remapping (#131007 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131007 Approved by: https://github.com/williamwen42	2024-07-18 17:23:41 +00:00
Boyuan Feng	8bf0be7c78	[CUDAGraph] Add operator.mul to skip list for find_input_mutations (#130986 ) The #130912 error happens since `operator.mul` does not have `_schema`. So why do we have `operator.mul` and why is it not dispatched to `torch.ops.aten.mul`? This op comes from %mul_3. %mul_3 : [num_users=50] = call_function[target=operator.mul](args = (%arg689_1, 4096), kwargs = {}) `%arg689_1` is a placeholder with `meta[‘val’] = s0`. It comes form dynamic shapes and represents the batch size since it’s also used in many other nodes such as: %view_1 : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%mm, [%arg689_1, 4096, 320]), kwargs = {}) and %native_group_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_group_norm.default](args = (%div_1, %arg16_1, %arg17_1, %arg689_1, 320, 4096, 32, 1e-06), kwargs = {}) To fix the issue, we can add `operator.mul` to skip list. Fixes #130912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130986 Approved by: https://github.com/eellison	2024-07-18 17:11:39 +00:00
mori360	5979014059	DSD for TorchTune LoRA (#129635 ) Fixes #128745 Solve the issue with conflicts when users use full_state_dict while the model is FSDP. Current solve the issue for `full_state_dict=True`, with error `'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).` TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is `NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635 Approved by: https://github.com/fegin	2024-07-18 17:00:35 +00:00
Zhengxu Chen	5484c86021	[export] Fully support extension op in serialization/deserialization. (#130851 ) Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch. Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization Differential Revision: D59825148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851 Approved by: https://github.com/angelayi	2024-07-18 16:47:53 +00:00
Iris Z	85451b2cde	[DTensor] Fix shard_dim_alltoall fake tensor return (#129945 ) shard_dim_alltoall op has a return type as a Tensor in its schemas (here: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L628), but its FakeTensor implementation returns a list of tensors (see the chunk() call here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/_collective_utils.py#L33). So it would error out when device="meta". This PR fixes the fake tensor mode return type for 1d mesh and adds a test to compare shape with non-meta tensor case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129945 Approved by: https://github.com/wanchaol	2024-07-18 16:43:40 +00:00
eellison	16aaff7783	Fix mm pad regresion - more conservative estimation of plannable inputs (#128909 ) - More conservative estimation of plannable inputs - Consider constant_pad_nd as pointwise node in concat lowering - Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909 Approved by: https://github.com/Chillee	2024-07-18 16:42:30 +00:00
Shangdi Yu	27ded03545	[FX][export] DCE pass, check schema for node impurity (#130395 ) Change the default DCE pass to check node schema for impure nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395 Approved by: https://github.com/angelayi, https://github.com/jgong5	2024-07-18 16:31:40 +00:00
Anshul Sinha	32ff04d30a	[dtensor][debug] adding functionality to control noisiness of the debug output (#130410 ) Summary Currently, the output of CommDebugMode contains a lot of noise, such as operations that usually won’t tell the user much information such as aten.detach.default. I have created a set of these trivial operations and added a user argument noise_level for users to choose how much information they would want to receive. noise_level = 1 prints module-level collective counts noise_level = 2 prints operations not included in trivial operations and module information noise_level = 3 prints all operations In addition, I have removed the generate_module_tracing_table since noise_level = 1 essentially replaces it. Finally, I have updated the examples and test cases. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130410 Approved by: https://github.com/XilunWu	2024-07-18 16:12:59 +00:00
Li-Huai (Allan) Lin	8ea03372a1	[MPS] Store philox counter as part of the RNG state (#130662 ) Fixes #130613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662 Approved by: https://github.com/malfet	2024-07-18 15:57:28 +00:00
cyy	7c90a82970	[Reland] [5/N] Change static functions in headers to inline (#131010 ) Reland of #130673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010 Approved by: https://github.com/Skylion007	2024-07-18 15:53:48 +00:00
PyTorch MergeBot	d6ae8bbf16	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit 9fee87e4cd9efb55ee5427a8e6b3c57de7c599f9. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9984688318/job/27595182606 `433ef4e444` Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2236867975))	2024-07-18 15:31:51 +00:00
PyTorch MergeBot	120fdf7ee2	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit e98135d1ad2f999fec649ecd21b35f3d5676be43. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/zou3519 due to broke trunk tests, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2236790805))	2024-07-18 14:58:25 +00:00
rzou	5a90ed3523	Reinplacing should ignore copy_ nodes where the mutated arg is not read (#130866 ) Might fix #127660, need to test some more cases. We update the reinplacing pass. If we have something like the following, where "sin" is a custom op (this situation should also apply to triton kernels) ```py def graph(x): y = sin(x) z = sin(y) x.copy_(z) ``` then the reinplacer used to produce the following: ```py """step 1: reinplaces the first sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) z = sin(x_clone) x.copy_(z) """step 2: reinplaces the second sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) sin_out(x_clone, out=x_clone) x.copy_(x_clone) ``` However, the first clone is unnecessary. It is safe to reinplace the first sin into the following: ```py def graph(x): sin_out(x, out=x) z = sin(x) x.copy_(z) ``` because there are no users of `x`'s original value (the copy_ node doesn't actually use the original value of x!) This PR updates the reinplacing pass to ignore copy_ in its computation of if the original value of the mutated argument is still needed. NB: this also applies to triton kernels, but it was easier for me to reason about custom ops (and my repros were all for custom ops). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866 Approved by: https://github.com/oulgen	2024-07-18 13:47:54 +00:00
drisspg	dd39dca034	Removing some cruff and updating signatures for consistency (#130871 ) # Summary - This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little - Fixes a bug with indexing on block mask - Adds some doc strings to helper funcs and fixes some misc typing things - Forces functions passed to `create_block_mask` to mask_mods and updates tests files Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871 Approved by: https://github.com/joydddd, https://github.com/Chillee	2024-07-18 13:32:11 +00:00
PyTorch MergeBot	9f6db5d0e2	Revert "Ensure staticmethods can be allowed in graph (#130882 )" This reverts commit b0387449db41c90fb4226baea97a8d889a0951c4. Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/atalman due to failing torchrec tests internally, please fix and reland ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2236528473))	2024-07-18 13:31:30 +00:00
redwrasse	63a0a65df9	Define 'zero-preserving unary functions' in docs (#130804 ) Make explicit the definition of 'zero-preserving unary functions' in the sparse tensors documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130804 Approved by: https://github.com/soulitzer	2024-07-18 13:30:29 +00:00
eqy	1b07d42171	Add @syed-ahmed to CUDA `CODEOWNERS` paths (#130971 ) CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/130971 Approved by: https://github.com/soulitzer	2024-07-18 11:55:10 +00:00
wizzniu	c986aeea2d	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-18 11:54:14 +00:00
Syed Tousif Ahmed	38b7d89aa4	Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage (#130472 ) We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see https://github.com/pytorch/pytorch/issues/124807, https://github.com/pytorch/pytorch/pull/125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory). Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`. In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter. CC: @zdevito @ptrblck @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/130472 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-18 11:33:21 +00:00
jananisriram	28a74b9fa4	[NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425 ) Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`. Functions modified in `caffe2/torch/nested/_internal/ops.py`: - `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, , M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `` dimension (`dim == 1`) to a `(B, M)` output tensor. - `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible. Functions modified in `caffe2/test/test_nestedtensor.py`: - `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. - `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor. Notes: - This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former. - The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this! - I modified some of the comments in the `sum` function as well as the unit tests for more clarity. Test Plan: Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList ``` Differential Revision: D59571209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425 Approved by: https://github.com/davidberard98	2024-07-18 10:48:18 +00:00
IvanKobzarev	e98135d1ad	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-18 08:27:53 +00:00
Michael Lazos	cf3f4285a8	Add recursive metadata guard test (#131002 ) Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](`d77af49380/torch/_dynamo/variables/builder.py (L1496)`) which will recursively wrap inner tensors adding metadata guards for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002 Approved by: https://github.com/bdhirsh	2024-07-18 08:24:43 +00:00
Xuehai Pan	134bc4fc34	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 07:49:19 +00:00
Andrii Grynenko	dfc3347c4a	[pytorch][counters] Make WaitCounter backend pluggable (#130934 ) Summary: This diff introduces a much more flexible model for WaitCounter backend: 1. Backend can be installed dynamically (even if not linked with pytorch) instead of relying on macros and swapping implementation at compile time 2. Multiple backends are supported at the same time. Test Plan: unit test Reviewed By: jamesperng Differential Revision: D59795863 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130934 Approved by: https://github.com/asiab4	2024-07-18 07:23:55 +00:00
PyTorch MergeBot	b732b52f1e	Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 )" This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d. Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))	2024-07-18 06:39:58 +00:00
angelayi	6c2c8ee15b	[export] Remove preserved ops from decomp list (#130970 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1466016147369925/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130970 Approved by: https://github.com/bdhirsh	2024-07-18 05:15:22 +00:00
Xuehai Pan	aecc746fcc	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 05:13:41 +00:00
Xuehai Pan	740fb22966	[BE][Easy][4/19] enforce style for empty lines in import segments in `functorch/` (#129755 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129755 Approved by: https://github.com/zou3519 ghstack dependencies: #129752	2024-07-18 05:08:03 +00:00
Animesh Jain	a085acd7d6	[dynamo] Revert back changes to UnspecializedBuiltinNNModuleVariable (#130991 ) xref - https://fb.workplace.com/groups/1075192433118967/permalink/1466525440652329/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130991 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-18 05:01:46 +00:00
Sam Larsen	9f392f8294	Use inductor TestCase for test_replicate_with_compiler.py (#129494 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-18 03:08:32 +00:00
PyTorch MergeBot	433ef4e444	Revert "[FX][export] DCE pass, check schema for node impurity (#130395 )" This reverts commit e22b0acc766db4a853fe8fd73e919b4adf0e3148. Reverted https://github.com/pytorch/pytorch/pull/130395 on behalf of https://github.com/yushangdi due to breaking tests, need to rebase and fix ([comment](https://github.com/pytorch/pytorch/pull/130395#issuecomment-2235192986))	2024-07-18 02:46:03 +00:00
Aidyn-A	bd56bcf0ab	[TEST] Fix _scaled_mm tests (#130897 ) This PR resolves several sets of `_scaled_mm` test failures: - `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them - `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897 Approved by: https://github.com/drisspg	2024-07-18 02:15:00 +00:00
angelayi	9fee87e4cd	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-07-18 01:36:01 +00:00
cyy	a0ae77b25b	Simpilfy cub::unique_by_key code (#130907 ) It removed an unused parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130907 Approved by: https://github.com/ezyang	2024-07-18 01:12:00 +00:00
Alnis Murtovi	d818c3319f	Autoheuristic: add config options for specifying optimizations to collect data for and use heuristics (#130245 ) Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245 Approved by: https://github.com/shunting314	2024-07-18 01:04:36 +00:00
Edward Z. Yang	051971ab32	Reorder MIOpen conditions so getCUDAHooks only called when CUDA input (#130867 ) See post for more details: [fb.workplace.com/groups/1405155842844877/permalink/8719141948112860](https://fb.workplace.com/groups/1405155842844877/permalink/8719141948112860/) Function getCUDAHooks() returns a reference to an object without checking if the object is null. In the AutoMOS QE, which runs a ML model in Messenger Android, we are getting native crashes because of this reason: [internalfb.com/code/fbsource/[b7f8e18320f9d5d8347c3428c67301f20c3c81d2]/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504](https://www.internalfb.com/code/fbsource/%5Bb7f8e18320f9d5d8347c3428c67301f20c3c81d2%5D/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504), crash [fburl.com/logview/xi4w7jk4](https://fburl.com/logview/xi4w7jk4) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130867 Approved by: https://github.com/albanD	2024-07-18 00:59:33 +00:00
Shangdi Yu	e22b0acc76	[FX][export] DCE pass, check schema for node impurity (#130395 ) Change the default DCE pass to check node schema for impure nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395 Approved by: https://github.com/angelayi, https://github.com/jgong5	2024-07-18 00:55:20 +00:00
cyy	73d0f484b3	[structural binding][11/N] Replace std::tie with structural binding (#130830 ) Follows #130784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130830 Approved by: https://github.com/janeyx99	2024-07-18 00:45:06 +00:00
eellison	e14d1d10ef	Unwrap Identity in prepare indexing (#130967 ) We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](`752c817898/torch/_inductor/codegen/triton.py (L1554)`) which would do wrapping with tl.full. I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64. Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-07-18 00:43:53 +00:00
Will Feng	d77af49380	[Traceable FSDP2] Preserve fsdp.set_ op through lowering; Add unit test for multiple .set_ into same primal; Add unit test for FSDP2 module layer reuse (#130786 ) Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_fullgraph_backend_inductor` - `pytest -rA test/functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_fsdp_set__into_same_input` - `PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py -k TestAOTAutogradWithCache.test_input_mutation_fsdp_set__into_same_input` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130786 Approved by: https://github.com/bdhirsh ghstack dependencies: #129773	2024-07-17 23:25:42 +00:00
Will Feng	fc3dbcd1c3	[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 ) FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead. This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op). One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes. --- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773 Approved by: https://github.com/eellison	2024-07-17 22:51:20 +00:00
Oguz Ulgen	442bfa7fc4	Fix mypy error (#130992 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130992 Approved by: https://github.com/izaitsevfb	2024-07-17 22:49:23 +00:00
Oguz Ulgen	a0da1265c5	Define key in codecache (#130979 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules' ``` Differential Revision: D59875657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979 Approved by: https://github.com/jamesjwu	2024-07-17 22:44:50 +00:00
Andrew Gu	31e3330040	[Reland][FSDP2] Allowed `List[nn.Module]` as arg (#130949 ) This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication. Approach At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node. To implement the runtime schedule, we define new forward hooks that run based on the following semantics: - If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op. - If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op. - First and last are determined by scoreboarding against a set of the modules. - This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward. Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`. Examples This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382. If at least one of the modules in the list does not run forward before backward, then there will be a warning message like: ``` 1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)] ``` --- Changes for reland: none since breakage was from PR below Pull Request resolved: https://github.com/pytorch/pytorch/pull/130949 Approved by: https://github.com/weifengpy ghstack dependencies: #130947	2024-07-17 22:40:14 +00:00
Andrew Gu	ff7e021e94	[Reland][PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 ) (#130947 ) This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`. --- Changes for reland: - The previous PR assumed that any `func` decorated with `@contract` would return the same input `module` as output (which is true for PT-D composable APIs). - However, TorchRec `shard` returns a different module as output (though that module _does_ satisfy the `@contract` FQN check). - This PR removes the assumption and instead only enforces the FQN check following the input module order. In other words, if calling `func([x1, ..., xN])` for `N` modules `x1, ..., xN` that returns `[y1, ..., yM]` for `M` modules, we require that `N = M` and that FQNs are preserved coordinate-wise: `xi` and `yi` have same FQNs for all `i = 1, ..., N`. Differential Revision: [D59863438](https://our.internmc.facebook.com/intern/diff/D59863438) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130947 Approved by: https://github.com/weifengpy, https://github.com/atalman	2024-07-17 22:40:13 +00:00
Boyuan Feng	90105a4f3e	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 21:59:52 +00:00
PyTorch MergeBot	874bbc53c9	Revert "Define key in codecache (#130979 )" This reverts commit 4112f687831fb6f3554ff659a0be45909a1b4639. Reverted https://github.com/pytorch/pytorch/pull/130979 on behalf of https://github.com/clee2000 due to broke lint on torch/_inductor/codecache.py https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 `f0faecd291` ([comment](https://github.com/pytorch/pytorch/pull/130979#issuecomment-2234392332))	2024-07-17 21:59:19 +00:00
Isuru Fernando	43a6d20883	Add decomposition for reflection_pad{1,2,3}d_backward (#130299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130299 Approved by: https://github.com/lezcano ghstack dependencies: #130130	2024-07-17 21:56:00 +00:00
PyTorch MergeBot	0eb43ed189	Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 )" This reverts commit f0faecd2915d73e56917922cc995237cef064e50. Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint, but for for torch/_inductor/codecache.py this time https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 `f0faecd291` ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234387254))	2024-07-17 21:55:48 +00:00
Nikita Shulga	ebdfc7e37d	[BE] Rename `ISORT_WHITELIST` to `ISORT_SKIPLIST` (#130987 ) To better represent what this list is doing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130987 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2024-07-17 21:52:56 +00:00
Jeff Daily	df5919393c	[ROCm] std::clamp work-around for hip-clang compiler (#127812 ) Fixes #127666. Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max. Using #ifndef USE_ROCM to differentiate between CUDA using std::clamp and the ROCm replacement broke Windows builds. The replacement generates the same PTX as std::clamp, so using the replacement unconditionally. The replacement generates the same PTX as std::clamp. See https://godbolt.org/z/Wde9KW3v4 for a sample. Original patch comes from @lamikr. Modified to improve efficiency. https://github.com/lamikr/rocm_sdk_builder/pull/37 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-17 21:31:17 +00:00
Boyuan Feng	f0faecd291	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 21:27:45 +00:00
Oguz Ulgen	4112f68783	Define key in codecache (#130979 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules' ``` Differential Revision: D59875657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979 Approved by: https://github.com/jamesjwu	2024-07-17 21:19:13 +00:00
PyTorch MergeBot	0b134c15cd	Revert "Relax constraints for creating a `GenericContextWrappingVariable` (#129091 )" This reverts commit 882fd9186924b4632fba65033717d97d15ad3339. Reverted https://github.com/pytorch/pytorch/pull/129091 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 `a8bd2933d9` ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))	2024-07-17 20:59:40 +00:00
PyTorch MergeBot	c49f909aab	Revert "wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 )" This reverts commit a8bd2933d9eaf24ec9582001efa844de499d9e93. Reverted https://github.com/pytorch/pytorch/pull/130490 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 `a8bd2933d9` ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))	2024-07-17 20:59:40 +00:00
Animesh Jain	65b4163bd2	[dynamo][nn-module] Make slice getitem on nn module container sourceless (#130852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130852 Approved by: https://github.com/mlazos ghstack dependencies: #130773	2024-07-17 20:17:08 +00:00
Guilherme Leobas	a8bd2933d9	wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #129091	2024-07-17 20:07:06 +00:00
Guilherme Leobas	882fd91869	Relax constraints for creating a `GenericContextWrappingVariable` (#129091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-07-17 20:07:06 +00:00
PyTorch MergeBot	41f5d5dcaf	Revert "[inductor] adapte windows file path (#130713 )" This reverts commit e51e971a8675826e517a78bf2a97f8e2df5f4abd. Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to sorry but I think its still failing, this time on windows CUDA https://github.com/pytorch/pytorch/actions/runs/9971126834/job/27552761451 `bb62e9d7c3`. It was not run on PR due to being on the periodic workflow, which isnt usually run on PRs due to capacity issues for windows CUDA machines. I will add ciflow/periodic to the PR to ensure the test gets run ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2234092078))	2024-07-17 19:37:16 +00:00
PyTorch MergeBot	1bf4a44b33	Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 )" This reverts commit ef0511245a92bae7057c195dcae2efc237b96f16. Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint for test/export/test_converter.py https://github.com/pytorch/pytorch/actions/runs/9979009143/job/27577181982 `ef0511245a`. Probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234067407))	2024-07-17 19:21:52 +00:00
Michael Lazos	b0387449db	Ensure staticmethods can be allowed in graph (#130882 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882 Approved by: https://github.com/anijain2305, https://github.com/williamwen42	2024-07-17 19:18:30 +00:00
Michael Lazos	e4f9d01cd9	Add test for dataclass field accesses (#130848 ) Fixes https://github.com/pytorch/pytorch/issues/120108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130848 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2024-07-17 19:14:23 +00:00
Michael Lazos	470f07c840	Add guard override capability for tensor subclass metadata (#130780 ) Fixes https://github.com/pytorch/pytorch/issues/114405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130780 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh ghstack dependencies: #130779	2024-07-17 19:13:53 +00:00
Michael Lazos	bea6762c01	Add guards on subclass metadata (#130779 ) This PR adds guards in dynamo which verify the equality of tensor subclass metadata along with tests verifying the expected recompile behavior. The next PR adds the capability to override the guard behavior to possibly perform the check in a less expensive manner. Toward fixing https://github.com/pytorch/pytorch/issues/114405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130779 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2024-07-17 19:13:52 +00:00
Bin Bao	752c817898	[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#130796 ) Summary: Unify the argment codegen logic between python wrapper and cpp wrapper. Differential Revision: [D59809273](https://our.internmc.facebook.com/intern/diff/D59809273) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130796 Approved by: https://github.com/oulgen	2024-07-17 18:37:23 +00:00
chilli	efefea52e0	renamed inductor kernel args in flexattention properly (#130869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130869 Approved by: https://github.com/drisspg, https://github.com/joydddd ghstack dependencies: #130809, #130818	2024-07-17 18:36:03 +00:00
chilli	480a5bd881	Renamed mask_fn to mask_mod (#130818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818 Approved by: https://github.com/drisspg ghstack dependencies: #130809	2024-07-17 18:36:03 +00:00
Pian Pawakapan	d96c80649f	[export] constants & non-persistent buffers for training IR (#130864 ) Summary: Uses original ExportedProgram constants and graph signature to inform decompositions, so that constant tensors and non-persistent buffers are respected for training IR. Removes 7 test failures for training IR. Test Plan: test_export Differential Revision: D59820909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130864 Approved by: https://github.com/angelayi	2024-07-17 18:27:53 +00:00
Boyuan Feng	ef0511245a	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 17:48:36 +00:00
Catherine Lee	d552e5c3d5	Fix ciflow/nightly triggering commit hash update workflow (#130570 ) Move the if statement to be higher so people don't get the below ![image](https://github.com/user-attachments/assets/e9be7d7c-6400-4f80-880f-d58dcb4b5495) like https://togithub.com/pytorch/pytorch/pull/130465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130570 Approved by: https://github.com/ZainRizvi	2024-07-17 17:13:50 +00:00
Xuehai Pan	db3290846e	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761 Approved by: https://github.com/fegin	2024-07-17 16:57:39 +00:00
Oguz Ulgen	1e13cb2f28	Log cache state to structured logs (#130845 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpRm4MaD/0_0_0/fx_graph_cache_hash_4.json Differential Revision: D59795574 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130845 Approved by: https://github.com/jamesjwu	2024-07-17 16:45:45 +00:00
lezcano	af0b5ee924	Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 ) We don't need to generate so many samples for these very expensive ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199 Approved by: https://github.com/peterbell10, https://github.com/zou3519	2024-07-17 16:29:36 +00:00
Sam Larsen	6e916f112f	[inductor] skip fx remote cache for 2 tests in test_metrics.py (#130853 ) Summary: `collect_defined_kernels()` is essentially patching deep inside to see if a specific codegen is happening. We could also patch somewhere in the cache path to make sure it's called, but I'm not sure that's really testing anything interesting. I suggest it's better to just disable the remote cache here. Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:metrics -- --exact 'caffe2/test/inductor:metrics - test_kernel_args_num_gb (caffe2.test.inductor.test_metrics.TestMetrics)' --run-disabled --stress-runs 10` Differential Revision: D59825899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130853 Approved by: https://github.com/oulgen	2024-07-17 16:17:43 +00:00
fduwjj	1fb572289b	[BE][c10d] Add a warning messages in the comment about cuda hang (#130844 ) Add comments to warn users potential hang for the cuda event query in NCCLPG. Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130844 Approved by: https://github.com/wconstab	2024-07-17 15:51:19 +00:00
Isuru Fernando	b7d2abd766	Fix vectorized ops.masked (#130130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130130 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-07-17 14:55:11 +00:00
Xuehai Pan	b29b23137c	[Easy] Fix argument name collision in dispatched functions (#129562 ) Use positional-only argument to avoid naming collision with aten ops arguments that are named "self". ```python In [1]: def foo(self, args, kwargs): ...: print(self, args, kwargs) ...: In [2]: def bar(self, /, args, **kwargs): ...: print(self, args, kwargs) ...: In [3]: foo(1, 2, self=3) TypeError: foo() got multiple values for argument 'self' In [4]: bar(1, 2, self=3) 1 (2,) {'self': 3} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562 Approved by: https://github.com/zou3519, https://github.com/fegin	2024-07-17 14:39:56 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Yutao Xu	32995dec28	Add support for XPU accumulate type (#128579 ) Provide an accumulate type interface specifically for XPU, similar to what was done for MPS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128579 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-07-17 14:33:53 +00:00
Xuehai Pan	76169cf691	[BE][Easy][9/19] enforce style for empty lines in import segments in `test/[e-h]*/` (#129760 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129760 Approved by: https://github.com/ezyang	2024-07-17 14:25:29 +00:00
angelayi	cbf274d4a7	[aoti] Add packaging solution (#129895 ) In this PR, I added support for packaging the AOTI generated files into a zipfile, and loading it in python. `compile_so` takes the path to the package, a device, and a desired so_path location, and compiles package into a .so, and saves to the specified location. `load_package` takes a path to the package and device, calls _extract_so, and then creates a callable to run the compiled model. The zipfile generated looks like the following: ``` \|- version \|- archive_format \|- data \|- aotinductor \|- cbtnafqaqrhvwztv7xudlal4xs6sofxa5oxccyuaqtrt6aozaklx.cubin # AOTI cuda generated cubin files \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe.cpp # AOTI generated cpp file \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_compile_flags # Flags for compiling the .o \|- c6qqtnpgwfi3dv5nb76ai773kt45ezoxfwdmd7q37lvq6fs2tnoi.o # AOTI saved const.o \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_linker_flags # Flags for linking the files to form the .so \|- constants \|- constants.pt # Constants saved using torch.save, can be loaded using mmap ``` The workflow is something like: ``` with torch.no_grad(): ep = torch.export.export( model, example_inputs, dynamic_shapes=dynamic_shapes, strict=False, ) gm = ep.module() package_path = torch._inductor.aot_compile( gm, example_inputs, options= { "aot_inductor.output_path": "my_path.pt2", # or a directory "aot_inductor.package": True, } ) compiled_model = torch._inductor.package.load_package(package_path, device) return compiled_model ``` I tried turning on loading the weights using mmap by default, but had some trouble with it, so that is just left as a todo Pull Request resolved: https://github.com/pytorch/pytorch/pull/129895 Approved by: https://github.com/malfet	2024-07-17 13:56:58 +00:00
PyTorch MergeBot	94a910b43b	Revert "Renamed mask_fn to mask_mod (#130818 )" This reverts commit 1a97bcf93b2ac98505ef6ff011ccb3565e456596. Reverted https://github.com/pytorch/pytorch/pull/130818 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/130818#issuecomment-2233367318))	2024-07-17 13:47:08 +00:00
PyTorch MergeBot	d027aef8f8	Revert "Removed q_num_blocks from constructor (#130819 )" This reverts commit 03c660468eb57772e82c1034613f5ff8781c775a. Reverted https://github.com/pytorch/pytorch/pull/130819 on behalf of https://github.com/atalman due to Internal problem with previous PR in stack https://github.com/pytorch/pytorch/pull/130818 ([comment](https://github.com/pytorch/pytorch/pull/130819#issuecomment-2233359569))	2024-07-17 13:43:35 +00:00
Alnis Murtovi	4b7ff35622	Fix flex_attention import in score_mod (#130906 ) torch.nn.attention._flex_attention has been renamed to torch.nn.attention.flex_attention, so the import does not work currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130906 Approved by: https://github.com/Chillee	2024-07-17 13:37:08 +00:00
PyTorch MergeBot	e1b2d8f975	Revert "[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 )" This reverts commit de177b50f89e45a57ac056ee64a64d7775b450ff. Reverted https://github.com/pytorch/pytorch/pull/130482 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/130482#issuecomment-2233309217))	2024-07-17 13:21:50 +00:00
xinan.lin	d3a11a0198	[Inductor] Handle device_put op in constant folding. (#130824 ) Fix #130823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130824 Approved by: https://github.com/eellison, https://github.com/EikanWang ghstack dependencies: #130817	2024-07-17 10:13:36 +00:00
xinan.lin	2af2d26562	[Inductor UT] Generalize device-bias code in test_triton_kernels.py and test_torchinductor.py (#130817 ) [Inductor UT] Generalize newly introduced device-bias code in test_triton_kernels.py::test_add_kernel and test_torchinductor.py::test_ctr_not_moved_to_cuda_when_used_in_index_put Fix #130814 , #130838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130817 Approved by: https://github.com/zou3519	2024-07-17 10:13:36 +00:00
William Wen	2300bb2a88	[3.13, dynamo] support TO_BOOL (#130565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130565 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460, #130461, #130564	2024-07-17 09:47:58 +00:00
William Wen	539acf7656	[3.13, dynamo] support CALL_KW (#130564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130564 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460, #130461	2024-07-17 09:47:58 +00:00
William Wen	e2365c05d7	[3.13, dynamo] fix instruction line numbers (#130461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130461 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460	2024-07-17 09:47:58 +00:00
William Wen	82b2e7a253	[3.13, dynamo] fix CALL_FUNCTION_EX in symbolic_convert (#130460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130460 Approved by: https://github.com/jansel ghstack dependencies: #130459	2024-07-17 09:47:58 +00:00
William Wen	8c9a996091	[3.13, dynamo] support LOAD_FAST_LOAD_FAST and STORE_FAST_STORE_FAST (#130459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130459 Approved by: https://github.com/jansel	2024-07-17 09:47:58 +00:00
Adrian Wälchli	bb62e9d7c3	Avoid autocast deprecation warning in DataParallel (#130660 ) Fixes #130659 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130660 Approved by: https://github.com/guangyey, https://github.com/fegin, https://github.com/albanD	2024-07-17 08:32:19 +00:00
Xuehai Pan	f6838d521a	[BE][Easy][5/19] enforce style for empty lines in import segments in `tools/` and `torchgen/` (#129756 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129756 Approved by: https://github.com/ezyang	2024-07-17 06:44:35 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Xu Han	e51e971a86	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-17 06:36:11 +00:00
Andrii Grynenko	7c45476d38	[pytorch][counters] WaitCounter cleanup (#130664 ) Summary: This diff does a minor cleanup of WaitCounters: 1. Fixes some singleton use to ensure one instance of WaitCounterImpl per counter per process 2. Updates API to enable measuring duration of individual wait operations Test Plan: unit test Differential Revision: D59709324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130664 Approved by: https://github.com/c-p-i-o, https://github.com/asiab4	2024-07-17 04:42:35 +00:00
Colin Peppler	419b8df0b6	[inductor][easy] add debug logs for inlining constants (#130799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130799 Approved by: https://github.com/chenyang78	2024-07-17 04:21:08 +00:00
Yu, Guangye	f2552dcc3d	refactor cached tensor more generic (#129359 ) # Motivation solve https://github.com/pytorch/pytorch/issues/129027 to refactor cached tensor to be generic. # Additional Context No API name change. It is only decoupling with CUDA build option. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129359 Approved by: https://github.com/eqy, https://github.com/EikanWang, https://github.com/albanD	2024-07-17 03:00:08 +00:00
Yu, Guangye	c6aa03bd4e	Add allow_xpu to enable XPU UTs (#130312 ) # Motivation enable UTs under folder test/xpu/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130312 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-07-17 02:40:28 +00:00
Wang, Eikan	fc238db62a	Separate AOTI Eager utils as a single file (#125819 ) The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire	2024-07-17 02:27:11 +00:00
Aaron Gokaslan	d1c4e6b55f	[BE]: Enable a few additional ruff rules (#130700 ) Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700 Approved by: https://github.com/justinchuby, https://github.com/ezyang	2024-07-17 02:06:04 +00:00
Yu, Guangye	c24c50da92	fix tensor print behavior for XPU (#130523 ) # Motivation Some XPU device don't support `double` data type. So we have to use `tensor.to(torch.float)` if it is a XPU tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130523 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD	2024-07-17 02:03:32 +00:00
Edward Z. Yang	aa95fb99af	On advice of James March, log pid instead of tid (#130679 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130679 Approved by: https://github.com/jmarchfb	2024-07-17 02:00:10 +00:00
Jack Taylor	e9023d57b0	[ROCm] Return correct AMDSMI socket_power metric (#130331 ) Extending on the change in https://github.com/pytorch/pytorch/pull/127729 Depending on gcnArch the API to return socket power will change based on underlying gpu_metrics. This PR will handle both cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130331 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/malfet	2024-07-17 01:58:58 +00:00
chilli	03c660468e	Removed q_num_blocks from constructor (#130819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130819 Approved by: https://github.com/drisspg ghstack dependencies: #130809, #130818	2024-07-17 01:41:20 +00:00
chilli	1a97bcf93b	Renamed mask_fn to mask_mod (#130818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818 Approved by: https://github.com/drisspg ghstack dependencies: #130809	2024-07-17 01:41:20 +00:00
chilli	6024fea0f8	Compute q_num_blocks from kv_num_blocks if q_num_blocks is not passed in (#130809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130809 Approved by: https://github.com/drisspg	2024-07-17 01:41:15 +00:00
Tristan Rice	ef9d9be236	TCPStoreLibUvBackend: log port on error (#130797 ) Adds better error messages when a socket fails to bind in libuv. New format: ``` The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied ``` Old format: ``` The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use ``` Test plan: Added test in `test_store.py` ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797 Approved by: https://github.com/kurman	2024-07-17 01:34:15 +00:00
Sam Larsen	25cb4426d3	[inductor] Add num_matches_for_scatter_upon_const_tensor to list of cached metrics (#130843 ) Summary: test/inductor:scatter_optimization is using this counter and fails with remote caching enabled Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:scatter_optimization -- --exact 'caffe2/test/inductor:scatter_optimization - test_cross_entropy_loss (caffe2.test.inductor.test_scatter_optimization.TestScatterOpt)' --run-disabled --stress-runs 10` Differential Revision: D59817406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130843 Approved by: https://github.com/oulgen	2024-07-17 00:41:22 +00:00
PyTorch MergeBot	8458dc8966	Revert "Use inductor TestCase for distributed tests (#129494 )" This reverts commit 3cd2ae331a5ed6839456bb0025c729a1ee50bc84. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))	2024-07-17 00:32:48 +00:00
PyTorch MergeBot	d7a8e8f7c5	Revert "[PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 )" This reverts commit b27695791e9cc4eedb1b713b1be20398bfeb911b. Reverted https://github.com/pytorch/pytorch/pull/127773 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/127773#issuecomment-2232004006))	2024-07-16 23:48:09 +00:00
Lei Wang (Server LLVM)	9a6d81b178	Fix pytorch JIT build for LLVM 18+ (#130661 ) Summary: LLVM upstream(https://github.com/llvm/llvm-project/pull/97824) changed `getHostCPUFeatures`to use Return StringMap. Fix this to unblock T195389358 Test Plan: ``` buck2 build mode/opt-clang-thinlto --upload-all-actions -c unicorn.hfsort="1" -c cxx.extra_cxxflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference -ferror-limit=0" -c cxx.extra_cflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference" -c cxx.profile="fbcode//fdo/autofdo/unicorn/topaggr/top_aggregator_server:autofdo" unicorn/topaggr:top_aggregator_server ``` Differential Revision: D59708722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130661 Approved by: https://github.com/Skylion007	2024-07-16 23:47:48 +00:00
eqy	de177b50f8	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg	2024-07-16 23:45:21 +00:00
PyTorch MergeBot	4f40a7078e	Revert "[FSDP2] Allowed `List[nn.Module]` as arg (#127786 )" This reverts commit d3ab8cecedd7843b8caed5946404704a18479811. Reverted https://github.com/pytorch/pytorch/pull/127786 on behalf of https://github.com/atalman due to bottom pr from the stack is failing on internal error ([comment](https://github.com/pytorch/pytorch/pull/127786#issuecomment-2231999178))	2024-07-16 23:45:17 +00:00
Michael Lazos	7919f0b952	Add buffer static input tests to cudagraph trees (#130402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402 Approved by: https://github.com/eellison ghstack dependencies: #130393	2024-07-16 22:12:38 +00:00
Michael Lazos	415d5e53ae	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh	2024-07-16 22:12:38 +00:00
PyTorch MergeBot	5f3c356a56	Revert "[inductor] adapte windows file path (#130713 )" This reverts commit 69e99172450e40536bf2e6c110183d34a0e283e2. Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to broke functorch\test_eager_transforms.py on windows https://github.com/pytorch/pytorch/actions/runs/9958208725/job/27530132704 `69e9917245`. Test failure on PR is real, possibly force merged to get around lint error? ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2231901793))	2024-07-16 22:07:55 +00:00
soulitzer	2eec02523b	[autograd] Support GradientEdge as output for torch.autograd.grad (#127766 ) This is useful for splitting grad to run in two parts while preserving intermediates: <details> <summary> Click to see code </summary> ```python import collections import weakref from torch.autograd.graph import GradientEdge def _get_grad_fn_or_grad_acc(t): if t.requires_grad and t.grad_fn is None: return t.view_as(t).grad_fn.next_functions[0][0] else: return t.grad_fn def reverse_closure(roots, target_nodes): # Recurse until we reach a target node closure = set() actual_target_nodes = set() q: Deque = collections.deque() for node in roots: if node is not None and node not in closure: closure.add(node) q.append(node) while q: node = q.popleft() reverse_edges = node.metadata.get("reverse_edges", []) for holder_ref, idx in reverse_edges: ref = holder_ref() if ref is not None: raise RuntimeError("Reverse graph is no longer alive") fn = ref.node if fn in closure or fn is None: continue if fn in target_nodes: actual_target_nodes.add(fn) continue closure.add(fn) q.append(fn) return closure, actual_target_nodes # Enable weak pointer class Holder(): def __init__(self, node): self.node = node # TODO: use weak references to avoid reference cycle def construct_reverse_graph(roots): q: Deque = collections.deque() root_seen = set() reverse_graph_refs = [] for node in roots: if node is not None and node not in root_seen: q.append(node) root_seen.add(node) while q: node = q.popleft() for fn, idx in node.next_functions: if fn is not None: # Don't necessarily need to store on the graph reverse_edges = fn.metadata.get("reverse_edges", []) if len(reverse_edges) == 0: q.append(fn) holder = Holder(node) holder_ref = weakref.ref(holder) reverse_graph_refs.append(holder) reverse_edges.append((holder_ref, idx)) fn.metadata["reverse_edges"] = reverse_edges return reverse_graph_refs def get_param_groups(inputs, params): inputs_closure, _ = reverse_closure(inputs, set()) param_groups = dict() # keyed on intermediates for i, param in enumerate(params): closure, intersected = reverse_closure([param], inputs_closure) param_group = { "params": set([param]), "intermediates": set(intersected), } for input_node in intersected: existing = param_groups.get(input_node, None) if existing is not None: existing["params"] = existing["params"].union(param_group["params"]) existing["intermediates"] = existing["intermediates"].union(param_group["intermediates"]) param_group = existing else: param_groups[input_node] = param_group # Sanity check: union of all param_groups params should be equal to all params union_params = set() seen_ids = set() unique_param_groups = [] for param_group in param_groups.values(): if id(param_group) not in seen_ids: seen_ids.add(id(param_group)) unique_param_groups.append(param_group) union_params = union_params.union(param_group["params"]) assert union_params == set(params) return unique_param_groups def compute_grads_only_inputs2(roots, inps, weights): root_grad_fns = list(map(_get_grad_fn_or_grad_acc, roots)) inp_grad_fns = list(map(_get_grad_fn_or_grad_acc, inps)) weight_grad_fns = list(map(_get_grad_fn_or_grad_acc, weights)) reverse_graph_refs = construct_reverse_graph(root_grad_fns) param_groups = get_param_groups(inp_grad_fns, weight_grad_fns) del reverse_graph_refs for param_group in param_groups: for i, intermediate in enumerate(param_group["intermediates"]): def get_hook(param_group, i): def hook(grad_inputs): if param_group.get("grads", None) is None: param_group["grads"] = [None] * len(param_group["intermediates"]) param_group["grads"][i] = grad_inputs return hook # These are always "split" nodes that we need to recompute, so # save their inputs. intermediate.register_prehook(get_hook(param_group, i)) dinputs = torch.autograd.grad((out,), inputs=tuple(inps), grad_outputs=(torch.ones_like(out),), retain_graph=True) return dinputs, param_groups def compute_grads_only_weights2(user_weights, param_groups): all_dweights = dict() for param_group in param_groups: # TODO: Handle case where intermediate can have multiple outputs intermediate_edges = tuple(GradientEdge(i, 0) for i in param_group["intermediates"]) weights_edges = tuple(GradientEdge(w, 0) for w in param_group["params"]) assert all(len(g) == 1 for g in param_group["grads"]) # [NEW!] Able to pass a GradientEdge to autograd.grad as output # We do not need to retain_graph because... guarantee no overlap? print("trying to execute: ", intermediate_edges, weights_edges) dweights = torch.autograd.grad(intermediate_edges, weights_edges, grad_outputs=sum(param_group["grads"], tuple())) for w, dw in zip(param_group["params"], dweights): all_dweights[w] = dw # return grads in the original order weights were provided in out = [] for w in user_weights: grad_acc = _get_grad_fn_or_grad_acc(w) out.append(all_dweights[grad_acc]) return tuple(out) ``` </details> ```python import torch.nn as nn # Setup mod1 = nn.Linear(10, 10) mod2 = nn.Linear(10, 10) a = torch.rand(10, requires_grad=True) weights = tuple(mod1.parameters()) + tuple(mod2.parameters()) inps = (a,) out = mod2(mod1(a)) class LoggingTensorMode(torch.utils._python_dispatch.TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): if kwargs is None: kwargs = {} rs = func(args, *kwargs) print(f"{func.__module__}.{func.__name__}") return rs print(" -- SPLIT -- ") # Compute gradients in two parts with LoggingTensorMode(): print("PART 1") dinputs, state = compute_grads_only_inputs2((out,), inps, weights) print("PART 2") dweights = compute_grads_only_weights2(weights, state) out = mod2(mod1(a)) print(" -- REF -- ") # Compare with reference with LoggingTensorMode(): ref_all_gradients = torch.autograd.grad(out, inputs=tuple(inps) + weights, grad_outputs=(torch.ones_like(out),)) for actual, ref in zip(dinputs + dweights, ref_all_gradients): print(torch.allclose(actual, ref)) ``` <img width="598" alt="image" src="https://github.com/pytorch/pytorch/assets/13428986/3681b8a7-3ab4-4d1d-a836-abef6913e671"> ``` PART 1 torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.ones_like.default V0603 10:17:21.590878 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1ee160> with grad_outputs: [f32[10]] torch._ops.aten.view.default V0603 10:17:21.591204 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default V0603 10:17:21.591578 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x100d7ae50> with grad_outputs: [f32[1, 10]] torch._ops.aten.view.default V0603 10:17:21.591747 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a60> with grad_outputs: [f32[10]] torch._ops.aten.view.default V0603 10:17:21.591834 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default V0603 10:17:21.591922 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a90> with grad_outputs: [f32[1, 10]] torch._ops.aten.view.default PART 2 trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1e4bb0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a21b130>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b7c0>, output_nr=0)) V0603 10:17:21.592223 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default torch._ops.aten.t.default torch._ops.aten.sum.dim_IntList torch._ops.aten.view.default V0603 10:17:21.592421 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a1cad60> with grad_outputs: [f32[10, 10]] torch._ops.aten.t.default trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1ee0d0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a1e41c0>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b670>, output_nr=0)) V0603 10:17:21.593481 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default torch._ops.aten.t.default torch._ops.aten.sum.dim_IntList torch._ops.aten.view.default V0603 10:17:21.593750 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a21b2b0> with grad_outputs: [f32[10, 10]] torch._ops.aten.t.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127766 Approved by: https://github.com/albanD	2024-07-16 21:46:19 +00:00
PyTorch MergeBot	c1e7e40f24	Revert "[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 )" This reverts commit f2f31027ce8dc3985663bf6eaa66f3c5559b724a. Reverted https://github.com/pytorch/pytorch/pull/129773 on behalf of https://github.com/clee2000 due to failed inductor/test_torchinductor_dynamic_shapes.py on mac https://github.com/pytorch/pytorch/actions/runs/9963396991/job/27530249256 `f2f31027ce`. The build failed on PR so test jobs didn't run ([comment](https://github.com/pytorch/pytorch/pull/129773#issuecomment-2231808437))	2024-07-16 20:54:14 +00:00
Atul Jangra	4e479568df	[PT2] Log compile ID in the signpost event (#130801 ) Summary: We should log compile ID as well for easier comparison. Currently going through some of this data, I think we should make few more changes as well. Reland for D59725870 Test Plan: Sandcastle and Pytorch Differential Revision: D59789110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130801 Approved by: https://github.com/oulgen	2024-07-16 20:47:36 +00:00
Yifu Wang	2ceade37c5	[SymmetricMemory] put socket files in /tmp (#130757 ) Currently the socket files are put in the current directory, which may not be writable in all environments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130757 Approved by: https://github.com/Chillee ghstack dependencies: #130756	2024-07-16 20:21:05 +00:00
Yifu Wang	0468f2616a	[SymmetricMemory] make sure different subgroups with the same name use different store prefixes (#130756 ) This fixes a race condition in which different subgroups with the same name on the same host would use the same store. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130756 Approved by: https://github.com/Chillee	2024-07-16 20:21:05 +00:00
Will Feng	f2f31027ce	[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 ) FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead. This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op). One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes. --- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773 Approved by: https://github.com/eellison	2024-07-16 20:07:41 +00:00
Sam Larsen	156b99cfb1	[inductor] Handle inductor counters in fx graph cache (#130635 ) Summary: Similar to the handling of metrics, save inductor counter deltas in the FX graph cache entry and increment the counters appropriately on a cache hit Test Plan: new unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/130635 Approved by: https://github.com/eellison	2024-07-16 20:07:16 +00:00
David Berard	d548417d95	[NJT] throw an exception if nested_tensor_from_jagged is fx-traced without being fx.wrapped (#130702 ) The NJT constructor can't be fx-traced safely due to the dummy nt used: `774ca93fd2/torch/nested/_internal/nested_tensor.py (L501-L508)` The error doesn't appear immediately, but appears if you try to move a module with an fx-traced NJT constructor onto a different device, or try to serialize it. Let's throw an error if we try to fx-trace the NJT constructor so users know to wrap the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130702 Approved by: https://github.com/jbschlosser, https://github.com/soulitzer	2024-07-16 19:21:10 +00:00
PyTorch MergeBot	0851de5b16	Revert "[ONNX] Remove beartype usage (#130484 )" This reverts commit 1794c35912025aa44b0d70f67ff664b4f7bd1014. Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/clee2000 due to test_sympy_utils failure is real https://github.com/pytorch/pytorch/actions/runs/9961499559/job/27523758780 `1794c35912`. Dr CI is matching with commits in current commit? ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2231575577))	2024-07-16 18:41:51 +00:00
Joel Schlosser	09b1b113f5	Cache min / max seq len for torch.nested.as_nested_tensor(t) (#130766 ) For the `torch.nested.as_nested_tensor(t)` constructor, computing min / max seq len is trivial since the sequence lengths are all the same. Might as well cache them during construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130766 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-07-16 18:32:47 +00:00
Edward Z. Yang	408c921d96	Make hashing a SymInt raise an error again (#130548 ) See https://github.com/pytorch/pytorch/issues/130547 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/lezcano	2024-07-16 18:30:30 +00:00
Xu Zhao	1d8baa4df2	[torchbench][servicelab] Fix servicelab test failures (#130781 ) Fix servicelab test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781 Approved by: https://github.com/desertfire	2024-07-16 17:35:13 +00:00
Justin Chu	1794c35912	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-16 17:34:36 +00:00
Jiashen Cao	67e22d6c61	[Fix]: Convert operator that does specialization to its symbolic counterpart (#129578 ) #### Issue During conversion, use symbolic operator when exist. #### Test Plan `pytest test/export/test_converter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129578 Approved by: https://github.com/angelayi	2024-07-16 17:19:57 +00:00
Pian Pawakapan	e8998d68c8	[export] add non-strict training IR (#130062 ) Summary: Adds non-strict implementation of training IR export. Any expected non-strict training IR failures are also either existing strict training IR or non-strict failures (no new failures added). 4 strict training IR failures also resolved. Refraining from unifying export/export_for_training, per @ydwu4's feedback :) Test Plan: added test_export_training_ir_to_run_decomp_non_strict.py for non-strict training IR Differential Revision: D59349454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130062 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2024-07-16 17:08:00 +00:00
Sidney Tsang	d2f44eabe7	[Export] Support aten.full.default and aten.full_like.default (#130639 ) Summary: Add operator tests for full & full_like operators Test Plan: Rerun kernel test using ``` buck2 run //glow/fba/tests:run_kernel mode/dev -- --kernel splat --config "input=1;dtype=fp32;fill_value=42.0" -tl_time ``` {F1752274071} Operator tests ``` buck2 run mode/{opt,inplace} //caffe2/torch/fb/test_library:afg_operator_test -- -k __full__ ``` {F1752340913} Differential Revision: D59593849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130639 Approved by: https://github.com/StellarrZ	2024-07-16 16:50:04 +00:00
Colin Peppler	f272e0ab4a	[inductor] support unbacked symint divisors in vars_and_sizes (#130595 ) Scenario: ``` >>> nodes IterationRangesEntry( x2, divisor=192u0 + 192576, length=s1, (xindex//(192u0 + 192576)), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x1, divisor=192, length=u0 + 1003, ModularIndexing(xindex, 192, u0 + 1003), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x0, divisor=1, length=192, ModularIndexing(xindex, 1, 192), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) ``` Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint? Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595 Approved by: https://github.com/aakhundov, https://github.com/ezyang	2024-07-16 16:21:38 +00:00
drisspg	2b43d339fe	Make FlexAttention API public (#130755 ) # Summary Makes the prototype API flex_attention public Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755 Approved by: https://github.com/Chillee	2024-07-16 16:21:25 +00:00
PyTorch MergeBot	cbda8be537	Revert "Propagate buffer and parameter indices through AOT (#130393 )" This reverts commit 69a77389e2c4052834c89a25757cdbf5f83b6208. Reverted https://github.com/pytorch/pytorch/pull/130393 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
PyTorch MergeBot	9cb23ba85b	Revert "Add buffer static input tests to cudagraph trees (#130402 )" This reverts commit 80236dca90b0874cb2b6f9c9fa5f159c55726401. Reverted https://github.com/pytorch/pytorch/pull/130402 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
Sam Larsen	c509319210	[inductor] Disable remote fx graph cache in test_snode_runtime (#130655 ) Summary: Unfortunately we can't save / restore metrics.metrics.node_runtimes in the cache entries because these contain objects that don't pickle: `TypeError: cannot pickle 'PyCapsule' object`. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:snode_runtime -- --exact 'caffe2/test/inductor:snode_runtime - test_mm (caffe2.test.inductor.test_snode_runtime.ComputeBoundedTests)' --run-disabled --jobs 18 --stress-runs 10` Differential Revision: D59705654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130655 Approved by: https://github.com/oulgen	2024-07-16 15:11:17 +00:00
Aaron Enye Shi	aa4ad711ef	[CCA][Memory Snapshot] Create TraceEntryRingBuffer class for alloc_trace logic (#130741 ) Summary: Move the alloc_trace logic into a separate class, to reduce risk of deadlocks when mixing with CCA's lock. Switch to an std::mutex instead of std::recursive_mutex. Let's us re-use the logic in TraceEntryRingBuffer class for later diffs. Test Plan: CI, resnet run, and FBR model. Differential Revision: D59690408 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130741 Approved by: https://github.com/davidberard98	2024-07-16 15:01:48 +00:00
eellison	e11c41035c	Directly use empty strided in cudagraph copy (#130777 ) We had an issue with the `-1` somehow ending up in negative num elements required. not sure why the original didn't work - we should land if CI is green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130777 Approved by: https://github.com/BoyuanFeng	2024-07-16 14:37:30 +00:00
Aaron Orenstein	4c3348932c	typing: convert_frame (#130670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130670 Approved by: https://github.com/Skylion007 ghstack dependencies: #130669	2024-07-16 14:31:35 +00:00
Aaron Orenstein	ea25febfab	typing: storage (#130669 ) This isn't a full typing of the file - it just fixes some uses of unbound 'T' (if you use a TypeVar as an output it also needs to be an input). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130669 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-07-16 14:31:35 +00:00
Isuru Fernando	8390843eba	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-16 14:29:29 +00:00
David Berard	1fbfb3202d	[docs][TorchScript] document c10::AliasAnalysisKind::CONSERVATIVE (#130765 ) I spent a while trying to search this to remember what this was called. Adding it to the OVERVIEW.md docs so it's easier to search Pull Request resolved: https://github.com/pytorch/pytorch/pull/130765 Approved by: https://github.com/nmacchioni, https://github.com/eellison, https://github.com/aaronenyeshi	2024-07-16 14:20:31 +00:00
Xu Han	69e9917245	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-16 13:53:39 +00:00
Aaron Gokaslan	53e5b8ac5b	[BE]: Update flake8-comprehensions and enable C420 (#130699 ) Uses `dict.fromkeys` whenever possible as covered by flake8-comprehensions rule C420. While the ruff rule RUF025 is still in preview, flake8-comprehensions have added a new rule which covers this. Use dict.fromkeys is faster when the value being added to the dictionary is the same at every iteration and is immutable, it also removes an unnecessary dict comprehension. This rule will be enabled with our current ruleset in RUF in 0.6 as C420. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130699 Approved by: https://github.com/lezcano, https://github.com/ezyang	2024-07-16 13:47:49 +00:00
Xu Zhao	213685ba97	[torchao][pt2 benchmark runner] Run performance test non-alternately (#130136 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 ``` ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune ``` Differential Revision: D59332736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136 Approved by: https://github.com/jerryzh168	2024-07-16 13:38:17 +00:00
eellison	67c6941b4e	Update torch.cat decomp for 0-dim (#130763 ) Fix for https://github.com/pytorch/pytorch/issues/130615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130763 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2024-07-16 13:34:01 +00:00
Jiong Gong	705da70f2c	[inductor][cpp] align dtype convert cache between vec and scalar kernels (#130677 ) The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers. `pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py` before ```c++ extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr1) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = tmp1 + tmp1; auto tmp3 = at::vec::convert<bfloat16>(tmp2); auto tmp4 = at::vec::convert<float>(tmp3); auto tmp5 = tmp1 + tmp4; auto tmp6 = at::vec::convert<bfloat16>(tmp5); tmp6.store(out_ptr1 + static_cast<long>(x0), 16); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = decltype(tmp1)(tmp1 + tmp1); auto tmp3 = c10::convert<bfloat16>(tmp2); auto tmp4 = decltype(tmp1)(tmp1 + tmp2); auto tmp5 = c10::convert<bfloat16>(tmp4); out_ptr1[static_cast<long>(x0)] = tmp5; } } } ``` after ```c++ extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr1) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = tmp1 + tmp1; auto tmp3 = at::vec::convert<bfloat16>(tmp2); auto tmp4 = tmp1 + tmp2; auto tmp5 = at::vec::convert<bfloat16>(tmp4); tmp5.store(out_ptr1 + static_cast<long>(x0), 16); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = decltype(tmp1)(tmp1 + tmp1); auto tmp3 = c10::convert<bfloat16>(tmp2); auto tmp4 = decltype(tmp1)(tmp1 + tmp2); auto tmp5 = c10::convert<bfloat16>(tmp4); out_ptr1[static_cast<long>(x0)] = tmp5; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677 Approved by: https://github.com/leslie-fang-intel	2024-07-16 13:25:05 +00:00
PyTorch MergeBot	68a4f2a3df	Revert "Tighten torch.library.infer_schema input types (#130705 )" This reverts commit ca2d424c6e5358f9fee8dc9ee7477de76b50f848. Reverted https://github.com/pytorch/pytorch/pull/130705 on behalf of https://github.com/atalman due to Failing internal CI ([comment](https://github.com/pytorch/pytorch/pull/130705#issuecomment-2230821876))	2024-07-16 12:57:11 +00:00
Andrea Frittoli	dee0f43fde	Add a CI job to check runner det sync (#129746 ) Add a new CI job that runs only when the runner determinator files are modified. The jobs checks that the runner_determinator.py script is in sync with the version embedded in _runner-determinator.yaml. Fixes TBD Pull Request resolved: https://github.com/pytorch/pytorch/pull/129746 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi, https://github.com/jeanschmidt	2024-07-16 11:44:55 +00:00
Jovian Anthony Jaison	e57101d927	Add testing regarding SparseAdam state_dicts (#130645 ) Summary: - Updated SparseAdam to run test_state_dict_deterministic unit test. - Made gradients sparse while keeping weights dense in the above test. Test Plan: - Ran test_optim.py locally. Fixes #116507 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645 Approved by: https://github.com/janeyx99	2024-07-16 11:29:22 +00:00
cyy	168e41009b	[structural binding][10/N] Replace std::tie with structural binding (#130784 ) Follows #130404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130784 Approved by: https://github.com/malfet	2024-07-16 10:28:14 +00:00
Xuehai Pan	747b38c131	[BE][Easy][2/19] enforce style for empty lines in import segments in `.ci/` and `.github/` (#129753 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129753 Approved by: https://github.com/malfet ghstack dependencies: #129752	2024-07-16 09:40:00 +00:00
Yu, Guangye	096dc444ce	Keep zero check be compatible with different sympy versions (#130729 ) # Motivation I found a difference between sympy 1.12 and 1.13. ```python # for 1.12 >>> import sympy >>> a = sympy.Number(0.0) >>> a == 0 True ``` ```python # for 1.13 >>> import sympy >>> a = sympy.Number(0.0) >>> a == 0 False ``` The different behavior will impact the result of [safe_mul](`6beec34b1c/torch/utils/_sympy/value_ranges.py (L521-L528)`), resulting in an incorrect results when `a = sympy.Number(0.0)`, `b = inf` and the result is `nan` if sympy version is 1.13. (the expected result is 0) ```python def safe_mul(a, b): # Make unknown() * wrap(0.0) == wrap(0.0) if a == 0.0: return a elif b == 0.0: return b else: return a * b ``` In different sympy versions, `sympy.Number(0)` always has the same behavior that equals to 0.0. ```python >>> import sympy >>> a = sympy.Number(0) >>> a == 0.0 True # for different sympy versions ``` So, use 0.0 when checking zero in safe_mul to keep compatible with different sympy versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130729 Approved by: https://github.com/lezcano, https://github.com/EikanWang	2024-07-16 08:39:00 +00:00
Animesh Jain	fedae41c57	[dynamo] Do not mark nn.module containers as BuiltinNNModuleVariable (#130773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130773 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-16 06:55:46 +00:00
Aaron Gokaslan	83eedf66b9	Update libfmt submodule to 11.0.1 (#130628 ) Update libfmt to 11.0.1 reopen of https://github.com/pytorch/pytorch/pull/129962. Requires a kineto update and moves fmt::join into a separate include so added it where necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130628 Approved by: https://github.com/aaronenyeshi	2024-07-16 06:12:11 +00:00
chuanqiw	c549629696	[CD] Fix xpu nightly wheel test failure (#130742 ) The xpu nightly wheel test met permission issue on `linux.idc.xpu` runner. Because those runners onboarded with `jenkins` user but the binary test in docker container with `root` directly. The temp files can't be deleted, refer https://github.com/pytorch/pytorch/actions/runs/9935452320/job/27448053625#step:8:91 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130742 Approved by: https://github.com/atalman	2024-07-16 05:31:20 +00:00
cyy	95dbbf713e	[Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109 ) Follows #125102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109 Approved by: https://github.com/ezyang	2024-07-16 04:23:42 +00:00
Wanchao Liang	7b2e802f31	[dtensor] add a few dunder methods to pointwise ops (#130754 ) fixes https://github.com/pytorch/pytorch/issues/130671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130754 Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/msaroufim ghstack dependencies: #130753	2024-07-16 02:53:35 +00:00
Wanchao Liang	2b2671a7b1	[dtensor] fix foreach_norm when ord is 2 (#130753 ) as titled, fixed a case when passing ord as 2 (default value), the op dispatching does not receive the default value case We simply check if the args schema receiving a `ord` field or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/130753 Approved by: https://github.com/awgu	2024-07-16 02:53:35 +00:00
Aaron Gokaslan	a29052a0bf	[BE][Ez]: Update ruff to 0.5.2 (#130698 ) Update ruff to 0.5.2 which bugfixes and performance improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/130698 Approved by: https://github.com/ezyang	2024-07-16 01:31:30 +00:00
Adrian Wälchli	ad314a2f05	Pass `torch.load(weights_only=)` internally to avoid FutureWarning (#130663 ) Fixes #130658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663 Approved by: https://github.com/malfet, https://github.com/LucasLLC	2024-07-16 01:24:38 +00:00
Sam Larsen	3cd2ae331a	Use inductor TestCase for distributed tests (#129494 ) Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-16 01:24:35 +00:00
Brian Hirsh	39eeaac4e5	inductor: avoiding moving constructor to cuda when it would cause h2d sync in index_put_ fallback (#130338 ) My attempt at a fix for https://github.com/pytorch/pytorch/issues/130335, see issue for more details / internal xref. Any feedback from inductor folks is appreciated. I attempted to make the move-constructors-to-cuda pass a bit less aggressive by detecting when the movement would incur a H2D sync for `aten.index_put_`. I'm not sure if there are any other ops that inductor falls back to eager on, that may-or-may-not incur a H2D sync if we change any of their inputs from cpu to cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130338 Approved by: https://github.com/eellison	2024-07-16 00:48:58 +00:00
Jiang, Yanbing	93a03edcf9	Update error message in meta__convert_weight_to_int4pack (#130707 ) This PR is to fix error message in https://github.com/pytorch/pytorch/pull/129940. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130707 Approved by: https://github.com/lezcano, https://github.com/malfet	2024-07-16 00:44:35 +00:00
Xuehai Pan	a3abfa5cb5	[BE][Easy][1/19] enforce style for empty lines in import segments (#129752 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129752 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-16 00:42:56 +00:00
eqy	5e617d7ef5	[CUDA] Actually bump tolerances for `test_grad_pca_lowrank` (#130770 ) Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770 Approved by: https://github.com/Skylion007	2024-07-16 00:41:10 +00:00
Michael Lazos	80236dca90	Add buffer static input tests to cudagraph trees (#130402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402 Approved by: https://github.com/eellison ghstack dependencies: #130391, #130392, #130503, #130393	2024-07-16 00:25:38 +00:00
Michael Lazos	69a77389e2	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392, #130503	2024-07-16 00:25:38 +00:00
Michael Lazos	200d3d0a89	Remove static param counting if inlining NN modules (#130503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130503 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392	2024-07-16 00:25:34 +00:00
Michael Lazos	0d0c09702a	Update mark_static_address for inlining NN modules (#130392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130392 Approved by: https://github.com/anijain2305 ghstack dependencies: #130391	2024-07-16 00:25:29 +00:00
Michael Lazos	d8616eb66a	Mark nn_module params and buffers as static in dynamo (#130391 ) This PR marks all buffers and parameters of an NNModule as static using the `mark_static_address` API. As a result, when tensors are passed to AOT, the `tensor_dict` metadata of placeholder nodes will contain the `static_address_type` key, indicating which graph argument positions are static for cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130391 Approved by: https://github.com/anijain2305	2024-07-16 00:25:23 +00:00
eellison	9ab8d47f9d	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-16 00:17:11 +00:00
yuqingj	ea4f310ff1	[Nested Tensor][easy] Add softmax backward support (#130602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130602 Approved by: https://github.com/davidberard98, https://github.com/jbschlosser	2024-07-16 00:07:42 +00:00
Andrew Gu	d3ab8ceced	[FSDP2] Allowed `List[nn.Module]` as arg (#127786 ) This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication. Approach At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node. To implement the runtime schedule, we define new forward hooks that run based on the following semantics: - If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op. - If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op. - First and last are determined by scoreboarding against a set of the modules. - This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward. Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`. Examples This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382. If at least one of the modules in the list does not run forward before backward, then there will be a warning message like: ``` 1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127786 Approved by: https://github.com/yf225, https://github.com/weifengpy ghstack dependencies: #127773	2024-07-15 23:54:10 +00:00
Andrew Gu	b27695791e	[PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 ) This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127773 Approved by: https://github.com/weifengpy	2024-07-15 23:54:10 +00:00
Bilal Khan	54a932b0ac	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/eqy, https://github.com/eellison	2024-07-15 23:23:23 +00:00
Sijia Chen	006020ff6e	Fix the cudagraph capture of SDPA (#130712 ) Summary: The scalar tensor by default is on CPU, which failed the cuda graph capture. To fix the issue, we put the scalar tensor on GPU Test Plan: buck2 test 'fbcode//mode/opt' fbcode//gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator -- --exact 'gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator - gen_ai.llm_inference.fb.tests.test_llama2_multimodal_generator.TestGenerator: test_multimodal_decode_gen2' Differential Revision: D59740639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130712 Approved by: https://github.com/Skylion007, https://github.com/chenyang78	2024-07-15 23:05:48 +00:00
Alnis Murtovi	50ef099ad0	Learn a heuristic to decide whether to pad before mm (#128643 ) This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code. The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code. The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set. The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen. On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often. The results for randomly generated data and huggingface look as follows: `max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`. The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set. A100 ``` max_depth min_samples_leaf dataset correct wrong unsure total max_wrong_speedup gman_wrong_speedup threshold 15 5.0 10 train 2730 4 3023 5757 1.372220 1.193873 1.702530 16 5.0 10 val 878 0 1042 1920 NaN NaN 1.702530 17 5.0 10 test 925 2 993 1920 1.741708 1.354954 1.702530 18 5.0 10 hf-train 14 0 22 36 NaN NaN 1.702530 19 5.0 10 hf-inf 7 0 1 8 NaN NaN 1.702530 ``` The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned: A100 hf-train: 60 A100 hf-inf: 10 ## Results on running huggingface locally This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds. #pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure"). I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency. Results on huggingface training ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.19 (+/- 0.00) 1.19 (+/- 0.00) -0.00 40.33 (+/- 1.13) 40.95 (+/- 0.78) -0.62 1.52 3 2 BartForConditionalGeneration 1.53 (+/- 0.06) 1.47 (+/- 0.05) 0.06 81.93 (+/- 5.20) 82.23 (+/- 1.92) -0.30 0.36 3 1 BlenderbotSmallForCausalLM 1.86 (+/- 0.04) 1.86 (+/- 0.00) 0.00 36.76 (+/- 0.49) 37.62 (+/- 1.33) -0.87 2.31 3 2 CamemBert 2.36 (+/- 0.01) 2.35 (+/- 0.01) 0.01 97.60 (+/- 1.91) 98.69 (+/- 1.35) -1.09 1.11 2 1 DistillGPT2 2.57 (+/- 0.01) 2.57 (+/- 0.01) 0.00 57.33 (+/- 0.77) 58.26 (+/- 1.41) -0.93 1.59 3 2 PLBartForCausalLM 2.07 (+/- 0.01) 2.06 (+/- 0.01) 0.01 32.54 (+/- 0.83) 34.65 (+/- 0.71) -2.11 6.10 3 2 PLBartForConditionalGeneration 1.87 (+/- 0.00) 1.88 (+/- 0.00) -0.01 58.45 (+/- 1.24) 58.95 (+/- 1.92) -0.50 0.85 3 1 RobertaForCausalLM 2.39 (+/- 0.01) 2.40 (+/- 0.01) -0.01 97.38 (+/- 1.52) 97.69 (+/- 1.18) -0.31 0.32 2 1 TrOCRForCausalLM 1.70 (+/- 0.00) 1.70 (+/- 0.00) -0.00 44.79 (+/- 1.33) 45.25 (+/- 1.08) -0.46 1.01 3 2 Mean difference in speedup: 0.01 Mean compilation latency saved: -0.80s Mean compilation latency reduction: 1.68% ``` Results on huggingface inference ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.11 (+/- 0.00) 1.11 (+/- 0.00) 0.00 19.02 (+/- 0.28) 19.40 (+/- 0.35) -0.38 1.95 3 2 BartForConditionalGeneration 1.26 (+/- 0.01) 1.23 (+/- 0.03) 0.03 36.84 (+/- 0.40) 36.55 (+/- 0.75) 0.30 -0.81 3 1 BlenderbotSmallForCausalLM 1.87 (+/- 0.02) 1.87 (+/- 0.01) 0.00 17.53 (+/- 0.31) 18.03 (+/- 0.43) -0.49 2.74 3 2 DistillGPT2 2.50 (+/- 0.02) 2.50 (+/- 0.01) 0.00 16.16 (+/- 0.29) 16.40 (+/- 0.18) -0.24 1.46 3 2 PLBartForCausalLM 1.93 (+/- 0.01) 1.94 (+/- 0.01) -0.00 15.30 (+/- 0.22) 16.01 (+/- 0.71) -0.71 4.43 3 2 PLBartForConditionalGeneration 1.98 (+/- 0.01) 1.98 (+/- 0.01) 0.00 25.90 (+/- 0.32) 26.58 (+/- 0.62) -0.67 2.53 3 1 TrOCRForCausalLM 1.61 (+/- 0.00) 1.62 (+/- 0.00) -0.01 21.38 (+/- 0.37) 21.85 (+/- 0.16) -0.47 2.16 3 2 Mean difference in speedup: 0.00 Mean compilation latency saved: -0.38s Mean compilation latency reduction: 2.07% ``` For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-07-15 23:04:06 +00:00
Sam Larsen	9a5204dc2d	[inductor] Remove "spawn" as an option for parallel compile method (#130746 ) Summary: Looks like "spawn" is broken. Since we have "subprocess", I don't think we need it any more, so just remove as an option. Test Plan: Verified that we get: `AssertionError: Invalid start method: spawn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130746 Approved by: https://github.com/Skylion007	2024-07-15 22:55:54 +00:00
Jiashen Cao	3f031b96c6	[Fix] Correctly identifying arguments for sub-blocks with renaming logic during TorchScript to ExportedProgram conversion (#128386 ) #### Issue Fix two issues related to inputs lifting when there are sub-blocks. * Some inputs may appear in the nested sub-blocks, which need a recursive search to identify which arguments need to be lifted / passed in the top-level block. * Some inputs to the sub-block are intermediate results, meaning their names are only number. This will cause issue during code generation (i.e., invalid argument name). We rename those to valid names. #### Test Plan * `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param` * `test/export/test_converter.py -s -k test_hidden_input_name` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128386 Approved by: https://github.com/angelayi	2024-07-15 22:48:13 +00:00
Jerry Zhang	b893aa71ca	Rename generate_numeric_debug_handle to numeric_debugger (#130590 ) Summary: att Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-15 22:42:27 +00:00
WeiChunyu-star	535016967a	Enable UFMT on all of torch/sparse (#130545 ) Partially addresses #123062 Ran lintrunner on: - torch/sparse Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/130545 Approved by: https://github.com/ezyang	2024-07-15 22:35:52 +00:00
Alex Dennis	7d4f50de19	dynamo add support for `defaultdict(set)` (#130745 ) Fixes #130554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130745 Approved by: https://github.com/Skylion007	2024-07-15 22:23:33 +00:00
William Wen	3928ca2ab6	[dynamo] update call map to allow multiple input parameters (#130748 ) Fixes https://github.com/pytorch/pytorch/issues/128072. Commandeering https://github.com/pytorch/pytorch/pull/128282 since the issue is now hi pri. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130748 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-07-15 22:16:49 +00:00
eqy	6f32dc0c7b	Don't pass error message as `places` in `assertGreaterAlmostEqual` (#130648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648 Approved by: https://github.com/awgu	2024-07-15 22:14:49 +00:00
PyTorch MergeBot	dff9d68f18	Revert "Fix names conflict when lifting (#129817 )" This reverts commit 53cf46b8c602f8512d49a5c30bca7fcf5411e25c. Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to Failing inductor/test_flex_attention.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27478084137 `74da2a467f` Sorry for the churn, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2229519886))	2024-07-15 22:08:45 +00:00
PyTorch MergeBot	78799e82b0	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit 1bc390c5f5ac065c156f55f4eceed267ecc67b41. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 `1bc390c5f5`. Test was introduced by `fa5f572748` which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))	2024-07-15 21:59:46 +00:00
Yifu Wang	db3a641b71	Implement operator for micro-pipelined all-gather -> _scaled_mm (#129289 ) This PR implements `torch.ops.symm_mem.fused_all_gather_scaled_matmul`. It's similar to `torch.ops.symm_mem.fused_all_gather_matmul`, except that it takes scales and calls ` _scaled_mm`. [Profiling Trace vs. Baseline](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0gmg1f2_) (FB internal only) Co-authored-by: Will Feng <yf225@cornell.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129289 Approved by: https://github.com/Chillee, https://github.com/weifengpy, https://github.com/drisspg	2024-07-15 21:48:35 +00:00
Shuqiang Zhang	77fb5b0e23	[c10d] a new Pytorch API (split_group) to create a process group (#130507 ) This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407 ncclCommSplit Summary: In current Pytorch/c10d, the new_group API is used to create a new process group from the default pg. When device_id is specified in init_process_group and nccl is used as the backend, the new_group call will use ncclCommSplit to create the nccl communicators to save communicator resources. It has a few drawbacks: Redundant calls Suppose the default group has 256 ranks, we need to have 32 children PGs and each child PG has 8 ranks. in this case, each rank needs to call new_group and ncclCommSplit 32 times because of how we implement new_group API and the collective requirement of ncclCommSplit. For a specific global rank, 31 calls of ncclCommSplit would be no_color split, and only 1 of them is colored split. With the proposed new split_group API, we expect only 1 call of split_group/ncclCommSplit is needed per rank in the above example case new_group can only split from default_pg Ideally, a new pg should be able to be split from any pg With the new split_group API, users can create new PGs using ncclCommSplit with less number of calls and initialize the PG eagerly. This is also useful in the cases of creating many P2P communicators. Test Plan: New UTs: e.g., python test/distributed/test_c10d_nccl.py -k test_comm_split_group_larger_scale Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507 Approved by: https://github.com/wconstab	2024-07-15 21:26:43 +00:00
Nikita Shulga	ac3e2cb64a	[BE] Delete unused -rg.yml workflow (#130759 ) As well as `_linux-test-label.yml` as ARC experiment is dead Pull Request resolved: https://github.com/pytorch/pytorch/pull/130759 Approved by: https://github.com/ZainRizvi	2024-07-15 20:41:59 +00:00
Iris Zhang (PyTorch)	ee6f0ab190	[DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 ) (#130685 ) Summary: As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different. This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases. As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash. ``` test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32 ``` Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs. Test Plan: ``` [irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends' ``` Reviewed By: gag1jain Differential Revision: D59725924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685 Approved by: https://github.com/gag1jain	2024-07-15 20:05:26 +00:00
chilli	27322355de	Added some more documentation to block mask creation (#130649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130649 Approved by: https://github.com/drisspg ghstack dependencies: #130626	2024-07-15 19:48:42 +00:00
yuqingj	0e79e1f958	[NJT+SDPA]Fix flash_attention output when batch_size=1 and seq_len=1 (#130652 ) fix issue #130196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130652 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jbschlosser	2024-07-15 19:44:04 +00:00
PyTorch MergeBot	074a5c0c9b	Revert "[BE] bump `optree` version to 0.12.1 (#130139 )" This reverts commit 8fcb156e8b5697a8f292db6db2a1803c5f4ce2d7. Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py `8fcb156e8b` ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))	2024-07-15 19:42:11 +00:00
Xu Han	f1456c74a0	Fix mkl-static issue for Windows. (#130697 ) Background: We found the pytorch Windows release/2.4 performance regression: https://github.com/pytorch/pytorch/issues/130619 After some debug works, I found the pytorch Windows static mkl build options are wrong: <img width="1049" alt="image" src="https://github.com/user-attachments/assets/38692142-bfca-4c98-8092-6e105c82bb13"> 1. Thread lib is wrong. 2. Miss `openmp` lib and config. > Debug history: https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226782504 and https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226418611 This PR will fix `mkl-static` build options issue. <img width="863" alt="image" src="https://github.com/user-attachments/assets/834f6cee-7e6d-4d74-b2bc-8a270f05e429"> Reference: <img width="482" alt="image" src="https://github.com/user-attachments/assets/8184dadb-f230-4062-a49f-51df1d7285f5"> https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.c6izlg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130697 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-07-15 19:28:11 +00:00
Wanchao Liang	a7cfe40c9b	[dtensor] Improve from_local API with run_check (#130289 ) as titled, this PR: 1. switch `run_check` to be by default False and add extra doc/comments about the correctness guarantee. Since I observed so many calls forget to use run_check=False, we should simply switch to not perform metadata check and make our documentation explicit 2. Implement metadata check by picking up the changes from https://github.com/pytorch/pytorch/pull/115229 3. Improve the from_local documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/130289 Approved by: https://github.com/awgu, https://github.com/wz337 ghstack dependencies: #130286, #130287, #130288	2024-07-15 18:52:55 +00:00
Wanchao Liang	3342f3aa4e	[dtensor] simplify sdpa strategies (#130288 ) as titled, this PR simplifies both flash and efficient attention op strategy generation paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/130288 Approved by: https://github.com/tianyu-l ghstack dependencies: #130286, #130287	2024-07-15 18:52:55 +00:00
Wanchao Liang	7d82dc2c23	[dtensor] slice_backward to use op strategy (#130287 ) as titled. slice_backward right now forward the sharding unconditionally, which is wrong mathmatically. This PR switch it to op strategy and only allow replication Pull Request resolved: https://github.com/pytorch/pytorch/pull/130287 Approved by: https://github.com/awgu ghstack dependencies: #130286	2024-07-15 18:52:49 +00:00
Zhanghan Wang	53cf46b8c6	Fix names conflict when lifting (#129817 ) ## Bug description When pending args that are potentially to be lift [here](`58f346c874/torch/_dynamo/output_graph.py (L1866)`) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](`58f346c874/torch/_dynamo/output_graph.py (L2081)`) can finally create a name ([here](`58f346c874/torch/fx/graph.py (L1008)`)) that overwrite args to lift. And thus causing a wrong output of graph. ## Reproducing Below is an reproduceable example, ```python import logging from typing import List import torch from functorch.compile import aot_module_simplified, make_boxed_func @torch.library.custom_op("mylib::somefunc_forward", mutates_args=()) def somefunc_forward( input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: return torch.ones_like(input_) @somefunc_forward.register_fake def _(input_, shape, weight): return torch.empty_like(input_) @torch.library.custom_op("mylib::somefunc_backward", mutates_args=()) def somefunc_backward( grad_output: torch.Tensor, input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: print(f"backward.{grad_output.shape=}") print(f"backward.{input_.shape=}") print(f"backward.{weight.shape=}") print(f"backward.{shape=}") assert list(weight.shape) == shape return torch.ones_like(weight) @somefunc_backward.register_fake def _(grad_output, input_, weight, shape): return torch.empty_like(weight) def a_func(grad_output, input_, weight_, shape): return torch.ones_like(input_.sum() * weight_) class SomeFunc(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, normalized_shape): ctx.normalized_shape = normalized_shape input_ = input.contiguous() weight_ = weight.contiguous() output = somefunc_forward(input_, weight_, ctx.normalized_shape) ctx.save_for_backward(input_, weight_) return output @staticmethod def backward(ctx, grad_output): input_, weight_ = ctx.saved_tensors # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape) grad_weight = somefunc_backward( grad_output.contiguous(), input_, weight_, ctx.normalized_shape, ) return None, grad_weight, None class MyModel(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.ones(7)) def forward(self, x): return SomeFunc.apply(x, self.weight, [7]) model = MyModel() torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True) def aot_print_backend(gm, sample_inputs): # Forward compiler capture def fw(gm, sample_inputs): print(f"----- fw") gm.print_readable() return make_boxed_func(gm.forward) # Backward compiler capture def bw(gm, sample_inputs): print(f"----- bw") gm.print_readable() return make_boxed_func(gm.forward) # Call AOTAutograd gm_forward = aot_module_simplified( gm, sample_inputs, fw_compiler=fw, bw_compiler=bw ) return gm_forward model = torch.compile( model, backend=aot_print_backend, dynamic=False, ) out = model(torch.rand((128, 4, 7))) out.mean().backward() ``` I can see log that showing calling into create_graph_input like ```log V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none) V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none) ``` And the backward graph generate will be like ```log class GraphModule(torch.nn.Module): def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"): contiguous_1 = contiguous contiguous_2 = contiguous_1 # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(), contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous(); somefunc_forward_default = None # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(args, *kwargs) somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]); contiguous = contiguous_1 = contiguous_2 = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (None, somefunc_backward_default) ``` The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`. ## Environment ```log Collecting environment information... PyTorch version: 2.5.0a0+git0b7e8df Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.5 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: version 3.26.4 Libc version: N/A Python version: 3.9.19 (main, May 6 2024, 14:39:30) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-14.5-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M3 Pro Versions of relevant libraries: [pip3] numpy==2.0.0 [pip3] optree==0.11.0 [pip3] torch==2.5.0a0+git0b7e8df [pip3] torchgraph==0.0.1 [conda] numpy 2.0.0 pypi_0 pypi [conda] optree 0.11.0 pypi_0 pypi [conda] torch 2.5.0a0+git0b7e8df dev_0 <develop> [conda] torchgraph 0.0.1 dev_0 <develop> ``` ## How to fix? I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you. @zou3519 @oulgen Co-authored-by: rzou <zou3519@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817 Approved by: https://github.com/zou3519	2024-07-15 18:49:12 +00:00
Guilherme Leobas	b4b64f76e5	Ensure tensors devices match on `torch.index_put` batch rule impl (#130479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130479 Approved by: https://github.com/zou3519	2024-07-15 18:16:31 +00:00
Joel Schlosser	00d71b3e86	Tweak tolerances for test_vjp_linalg_tensorsolve_cuda_float32 to pass in Windows / debug builds (#130449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130449 Approved by: https://github.com/zou3519, https://github.com/malfet ghstack dependencies: #128238, #130360	2024-07-15 17:35:34 +00:00
PyTorch MergeBot	9e161af179	Revert "Increase tolerance for tensorsolve tests (#130620 )" This reverts commit 103b6ccab2bd025dfacc8c8a91f71f3d68e50426. Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))	2024-07-15 17:35:04 +00:00
Xuehai Pan	8fcb156e8b	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519	2024-07-15 17:27:07 +00:00
PyTorch MergeBot	1e897a0ca4	Revert "Fix names conflict when lifting (#129817 )" This reverts commit 74da2a467f166e00316aee82ba24835ca563ed87. Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to broke dynamo/test_inline_inbuilt_nn_modules.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27461141919 `74da2a467f`. Test passed on PR, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2228993570))	2024-07-15 17:09:52 +00:00
Edward Z. Yang	0099e15b47	Also put unbacked symbols in symbol_to_node in split_module pass (#130535 ) This is not a complete fix but it is a simple one, full fix tracked in https://github.com/pytorch/pytorch/issues/130534 Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7510238679103969/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130535 Approved by: https://github.com/malfet	2024-07-15 16:56:01 +00:00
rzou	ca2d424c6e	Tighten torch.library.infer_schema input types (#130705 ) Made the following changes: - mutates_args is now keyword-only and mandatory. This is to align with torch.library.custom_op (which makes it mandatory because it's easy to miss) - op_name is now keyword-only. This helps the readability of the API - updated all usages of infer_schema This change is not BC-breaking because we introduced torch.library.infer_schema a couple of days ago. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705 Approved by: https://github.com/yushangdi	2024-07-15 16:43:57 +00:00
PyTorch MergeBot	9df4bc6a0d	Revert "Constant folding for dynamic shape node (#129686 )" This reverts commit b7d287fbec0a05a3d4c9524006e6bfd1de6a71a0. Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally. Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))	2024-07-15 15:19:24 +00:00
Yu, Guangye	7cd48df2da	Refine the logic of device construction when only device index is given (#129119 ) # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119 Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang ghstack dependencies: #129463, #129205, #129363	2024-07-15 14:34:29 +00:00
Yu, Guangye	9cae2160f5	Introduce the concept of Accelerators to PyTorch doc (#129363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #129463, #129205	2024-07-15 14:24:46 +00:00
Zhanghan Wang	74da2a467f	Fix names conflict when lifting (#129817 ) ## Bug description When pending args that are potentially to be lift [here](`58f346c874/torch/_dynamo/output_graph.py (L1866)`) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](`58f346c874/torch/_dynamo/output_graph.py (L2081)`) can finally create a name ([here](`58f346c874/torch/fx/graph.py (L1008)`)) that overwrite args to lift. And thus causing a wrong output of graph. ## Reproducing Below is an reproduceable example, ```python import logging from typing import List import torch from functorch.compile import aot_module_simplified, make_boxed_func @torch.library.custom_op("mylib::somefunc_forward", mutates_args=()) def somefunc_forward( input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: return torch.ones_like(input_) @somefunc_forward.register_fake def _(input_, shape, weight): return torch.empty_like(input_) @torch.library.custom_op("mylib::somefunc_backward", mutates_args=()) def somefunc_backward( grad_output: torch.Tensor, input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: print(f"backward.{grad_output.shape=}") print(f"backward.{input_.shape=}") print(f"backward.{weight.shape=}") print(f"backward.{shape=}") assert list(weight.shape) == shape return torch.ones_like(weight) @somefunc_backward.register_fake def _(grad_output, input_, weight, shape): return torch.empty_like(weight) def a_func(grad_output, input_, weight_, shape): return torch.ones_like(input_.sum() * weight_) class SomeFunc(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, normalized_shape): ctx.normalized_shape = normalized_shape input_ = input.contiguous() weight_ = weight.contiguous() output = somefunc_forward(input_, weight_, ctx.normalized_shape) ctx.save_for_backward(input_, weight_) return output @staticmethod def backward(ctx, grad_output): input_, weight_ = ctx.saved_tensors # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape) grad_weight = somefunc_backward( grad_output.contiguous(), input_, weight_, ctx.normalized_shape, ) return None, grad_weight, None class MyModel(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.ones(7)) def forward(self, x): return SomeFunc.apply(x, self.weight, [7]) model = MyModel() torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True) def aot_print_backend(gm, sample_inputs): # Forward compiler capture def fw(gm, sample_inputs): print(f"----- fw") gm.print_readable() return make_boxed_func(gm.forward) # Backward compiler capture def bw(gm, sample_inputs): print(f"----- bw") gm.print_readable() return make_boxed_func(gm.forward) # Call AOTAutograd gm_forward = aot_module_simplified( gm, sample_inputs, fw_compiler=fw, bw_compiler=bw ) return gm_forward model = torch.compile( model, backend=aot_print_backend, dynamic=False, ) out = model(torch.rand((128, 4, 7))) out.mean().backward() ``` I can see log that showing calling into create_graph_input like ```log V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none) V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none) ``` And the backward graph generate will be like ```log class GraphModule(torch.nn.Module): def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"): contiguous_1 = contiguous contiguous_2 = contiguous_1 # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(), contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous(); somefunc_forward_default = None # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(args, *kwargs) somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]); contiguous = contiguous_1 = contiguous_2 = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (None, somefunc_backward_default) ``` The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`. ## Environment ```log Collecting environment information... PyTorch version: 2.5.0a0+git0b7e8df Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.5 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: version 3.26.4 Libc version: N/A Python version: 3.9.19 (main, May 6 2024, 14:39:30) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-14.5-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M3 Pro Versions of relevant libraries: [pip3] numpy==2.0.0 [pip3] optree==0.11.0 [pip3] torch==2.5.0a0+git0b7e8df [pip3] torchgraph==0.0.1 [conda] numpy 2.0.0 pypi_0 pypi [conda] optree 0.11.0 pypi_0 pypi [conda] torch 2.5.0a0+git0b7e8df dev_0 <develop> [conda] torchgraph 0.0.1 dev_0 <develop> ``` ## How to fix? I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you. @zou3519 @oulgen Co-authored-by: rzou <zou3519@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817 Approved by: https://github.com/zou3519	2024-07-15 13:41:46 +00:00
rzou	ee039c0614	[custom_op] triton_op API V0 (#130637 ) This is the initial version of an API to create custom operators whose implementations are backed by triton kernels. While user-defined triton kernels work out-of-the-box with triton kernels, you may wish to construct a custom operator if you need to compose with other PyTorch subsystems, like Tensor subclasses or vmap. I'm hoping to get design feedback on this and ship it so that we can begin experimenting with customers. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637 Approved by: https://github.com/albanD	2024-07-15 13:00:54 +00:00
cyy	6beec34b1c	[structural binding][9/N] Replace std::tie with structural binding (#130404 ) Follows #130544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130404 Approved by: https://github.com/janeyx99	2024-07-15 10:14:52 +00:00
Aaron Gokaslan	ac28ae18dc	[BE][Ez]: Update pybind11 submodule to v2.13.1 (#129827 ) Updates pybind11 submodule to v2.13.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129827 Approved by: https://github.com/XuehaiPan, https://github.com/atalman, https://github.com/albanD	2024-07-15 08:58:56 +00:00
Animesh Jain	1d983bbb28	[easy][inline-inbuilt-nn-module] Update test output (#130681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130681 Approved by: https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #130654, #130420	2024-07-15 06:19:53 +00:00
Animesh Jain	1a266def4f	[dynamo][unsoundness but very controlled] Skip guards on inbuilt nn module hooks (#130420 ) Reduces the guard overhead from 2.1k units to 1k units. Compared to no-inlining (0.4k units), this reduces the slowdown from 5x to 2.5x. This introduces unsoundness, but only for hooks for inbuilt nn modules (user defined nn module hooks are fine). Each builtin nn module adds 4 empty ordered dict checks in the check_fn. This blows up for models with large numbers of builtin nn modules. With this PR, we skip those guards. There is no other easy way I can think of right now to control the guard overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130420 Approved by: https://github.com/jansel ghstack dependencies: #130654	2024-07-15 06:19:53 +00:00
Li-Huai (Allan) Lin	dc7725cc16	[halide-backend] Random number generation (#130211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130211 Approved by: https://github.com/jansel	2024-07-15 05:03:24 +00:00
Isuru Fernando	1bc390c5f5	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-15 04:16:17 +00:00
Wu, Chunyuan	a3c0bab502	[inductor] [cpp] use non-temporal tile load for A (#129455 ) Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding https://github.com/pytorch/pytorch/pull/129348 (also in this ghstack) on top of this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129455 Approved by: https://github.com/jgong5	2024-07-15 04:07:29 +00:00
Nikita Shulga	c547b2e871	Fix python detection in cuda.cmake (#130651 ) If Python package has not been detected previously, call it here This fixes regression introduced by https://github.com/pytorch/pytorch/pull/128801 that results in annoying, but harmless warning reported in https://github.com/pytorch/pytorch/issues/129777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130651 Approved by: https://github.com/Skylion007	2024-07-15 03:45:31 +00:00
PyTorch MergeBot	c0897919da	Revert " [5/N] Change static functions in headers to inline (#130673 )" This reverts commit 4410c44ae6fd8eb36f2358ac76f7d988ca7537c5. Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk. Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))	2024-07-15 03:27:11 +00:00
cyy	28f6ae2718	[9/N] Replace c10::optional with std::optional (#130674 ) Follows #130509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674 Approved by: https://github.com/Skylion007	2024-07-15 00:48:43 +00:00
Haoci Zhang	774ca93fd2	Added zb1p schedule (#130210 ) Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241. The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof: ![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7) Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210 Approved by: https://github.com/H-Huang	2024-07-14 17:32:59 +00:00
cyy	5fe9515d35	[structural binding][8/N] Replace std::tie with structural binding (#130544 ) Follows #130216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130544 Approved by: https://github.com/ezyang	2024-07-14 13:23:20 +00:00
leslie-fang-intel	81322aee74	[Inductor][CPP] Support more than one LocalBuffer (#129121 ) Summary Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion ``` Next Step - [✓] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126967	2024-07-14 11:31:14 +00:00
leslie-fang-intel	adaa0fea5a	[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 ) Summary Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)`). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)`). In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach. In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion ``` Next Step - [ ] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-14 11:28:10 +00:00
awayzjj	dcaa111dc8	support intersection by polyfill (#130672 ) Fixes https://github.com/pytorch/pytorch/issues/130557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130672 Approved by: https://github.com/anijain2305	2024-07-14 10:44:26 +00:00
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
Boyuan Feng	fa5f572748	[cudagraph] fallback to eager if re-record too many times (#129349 ) Summary: CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime. While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid: - [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses - Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION) - Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs. Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees. A summary on how static tensor address changes are handled now: - For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption. - If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node. - If the function_id has been re-recorded too many times, fallback to eager function and warning. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349 Approved by: https://github.com/eellison	2024-07-14 04:17:24 +00:00
cyy	4410c44ae6	[5/N] Change static functions in headers to inline (#130673 ) Follows #128286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673 Approved by: https://github.com/ezyang	2024-07-14 03:15:28 +00:00
Shivam Raikundalia	6f275ae4d0	Add kwinputs to Kineto Traces (#130373 ) Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces. Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested. Differential Revision: D59472345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373 Approved by: https://github.com/davidberard98	2024-07-14 00:40:59 +00:00
chilli	f9f85bfc0b	[Inductor] FlexAttention supports partial masking (#130415 ) (#130626 ) This is the new version of https://github.com/pytorch/pytorch/pull/130415 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Approved by: https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626 Approved by: https://github.com/drisspg, https://github.com/yanboliang	2024-07-14 00:37:26 +00:00
William Wen	cbb7e26acd	[3.13, dynamo] fix jump target offset calculation (#130458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130458 Approved by: https://github.com/jansel ghstack dependencies: #130383, #130384, #130385	2024-07-13 23:32:06 +00:00
William Wen	0b5792c0ae	[3.13, dynamo] fix NULL ordering in symbolic_convert CALL (#130385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130385 Approved by: https://github.com/jansel ghstack dependencies: #130383, #130384	2024-07-13 23:32:05 +00:00
William Wen	87b406d7e5	[3.13, dynamo] codegen TO_BOOL before conditional jump (#130384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130384 Approved by: https://github.com/jansel ghstack dependencies: #130383	2024-07-13 23:32:02 +00:00
William Wen	92ac9ee83c	[3.13, dynamo] swap null and pop_null in codegen (#130383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130383 Approved by: https://github.com/jansel	2024-07-13 23:31:57 +00:00
Gagan Jain	97cfc65dbc	Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 )" (#130676 ) Summary: Original commit changeset: 80c2ca639146 Original Phabricator Diff: D59612200 Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends' Differential Revision: D59719562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676 Approved by: https://github.com/xunnanxu	2024-07-13 23:19:22 +00:00
Tobias Ringwald	e5de25896f	Fixed CUDA randint generation for large ranges. (#126066 ) Fixes #125224 For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224). This also affects multiple other random functions, such as `torch.rand` and `torch.randn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-07-13 21:42:27 +00:00
PyTorch MergeBot	1f162a5fce	Revert "[Inductor][CPP] Support vectorization of remainder (#129849 )" This reverts commit 5bc18ec0a181fac0994522fefaf664f917d64b86. Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to fails the compilation of executorch benchmark internally ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2227054413))	2024-07-13 19:28:34 +00:00
Animesh Jain	8714b7fc69	[dynamo][cpp-guards] Use dict tags to skip guards on immutable dict getitems (#130654 ) Reduces the guard overhead from 3.7k units to 2.1k units. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130654 Approved by: https://github.com/jansel	2024-07-13 15:31:10 +00:00
cyy	7c83f5f7d5	[8/N] Replace c10::optional with std::optional (#130509 ) Follows #130510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509 Approved by: https://github.com/ezyang	2024-07-13 13:05:36 +00:00
PyTorch MergeBot	0effcb70ef	Revert "[ONNX] Remove beartype usage (#130484 )" This reverts commit f44739cf42e22a569bd1bdb0c113f8a069c17a41. Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/huydhn due to Sorry for reverting your change but those failures show up in trunk after the commit landed `f44739cf42`, I am reverting it to see if it fix trunk ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2226812311))	2024-07-13 07:52:59 +00:00
Aaron Orenstein	567482973d	typing fake_tensor.py (#128041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041 Approved by: https://github.com/eellison ghstack dependencies: #129182	2024-07-13 06:07:40 +00:00
drisspg	1ad0f38a37	Fix IMAs in FlexAttention + autotuning (#130352 ) # Summary Makes error message better for non divisible sequence lengths. Updates this PR was blocked due to two IMAs. - The first, is that when the kv indices ends up being an 'arange' I.e. there are non sparse blocks, we end up loading off of kv_indices + 1. - The second I dont really have a clear answer for. We were hitting an ima here: `9f401187c7/torch/_inductor/kernel/flex_attention.py (L846)` I noticed that the for our inputs 2048 and q_blocksize = 128 we were again exactly at 16. Something felt fishy. I suspect we launch one extra sparse_q block, But why only during autotuning... ### Repro: https://gist.github.com/drisspg/f312a66426f3440b7756c6c0cc037f4c ### After this change: ``` ========= COMPUTE-SANITIZER AUTOTUNE flex_attention(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x16, 1x1x16x16) triton_flex_attention_0 2.1118 ms 100.0% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_3 2.4306 ms 86.9% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_1 2.5729 ms 82.1% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_4 2.8035 ms 75.3% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_2 2.8837 ms 73.2% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 SingleProcess AUTOTUNE benchmarking takes 0.7225 seconds and 1.5218 seconds precompiling AUTOTUNE flex_attention_backward(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x2048, 1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x16, 1x1x16x16, 1x1x16, 1x1x16x16) triton_flex_attention_backward_30 2.7763 ms 100.0% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_15 3.1404 ms 88.4% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_backward_14 3.2604 ms 85.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_7 3.4176 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_backward_8 3.4182 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=4, num_warps=4 triton_flex_attention_backward_34 3.4939 ms 79.5% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 triton_flex_attention_backward_6 3.6517 ms 76.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_26 3.7000 ms 75.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 triton_flex_attention_backward_22 4.0120 ms 69.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_18 4.5052 ms 61.6% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 SingleProcess AUTOTUNE benchmarking takes 6.6558 seconds and 6.3567 seconds precompiling torch.Size([1, 1, 2048, 64]) Test completed successfully! ========= ERROR SUMMARY: 0 errors ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130352 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2024-07-13 05:27:39 +00:00
Will Feng	c03e667276	[Inductor][PatternMatcher] Always prevent match across mutations (#130584 ) Preventing match across mutations should always be the safe thing to do. This will be especially important for Traceable FSDP2 because in that case we do have mutation ops (`.set_` and `.resize_(0)`) in the middle of the graph for both joint-graph and post-grad graph, so making sure the pattern matcher passes work well with middle-of-graph mutation ops is important. Q: Why can't we move these mutation ops to the end of graph, to make pass writing easier? A: We attempted to do that in https://github.com/pytorch/pytorch/pull/129852, but the custom FX passes (in `torch/_functorch/_aot_autograd/fx_passes.py`) for the re-functionalization is complicated to maintain, and the changes to partitioner (in `torch/_functorch/partitioners.py`) also feels hacky. Hence we want to preserve these mutation ops in the middle of graph to avoid the complexity. Test commands: - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_uint4x2_mixed_mm` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_serialized_patterns_up_to_date` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130584 Approved by: https://github.com/jansel	2024-07-13 03:39:21 +00:00
joydddd	3710a79622	Flex Attention HOP: Add support for flex decoding (#129415 ) # Flex Decoding tl;dr This PR adds `flex_decoding` kernel to higher-order-op: `flex_attention` as the backend for multi-head attention decoding. Higher-order-op `flex_attention` was introduced in (https://github.com/pytorch/pytorch/pull/121845) to accept a user defined score modification callable (`score_mod`) and through `torch.compile`to create an efficient fused flash attention kernel instatiation. The `flex_attention` kernel is efficient for long queries (>512 tokens) attention. This PR introduces `flex_decoding` kernel as an alternative backend for `flex_attention` HOP to handle LLM inference where short queries (<32 tokens) attends to long key/value sequences. ### Details LLM decoding iteratively attends each newly generated token ( query length = 1 ) to a long key/value context (up to 132k). `flex_attention` kernel only parallelizes attention along query length (M), batch size (B) and number of heads (H) dimension. LLM decoding lacks enough parallelism in the M dimension to fill up all SMs on the modern GPUs. `flex_decoding` adds parallelization along key/value sequence length (N). The key/value cache of a single head are split into multiple blocks and the query tokens attends to them in parallel. The results for the same head are then reduced across KV blocks to generate a global output. ## Examples Consider a Group Query Attention (GQA) decoding case, where a query token of 16 query heads (Hq) attends to 2 kv head (Hkv). Assume a batch size of 2 (B=2) and kv cache length of 4096 (N=4096). The attention kernel iteratively attends to newly generated query token (Mq = 1). We transform this problem into a Multiheaded Attention (MHA) problem by assuming a query length equal to number of query heads per kv heads, i.e. M=Hq//Hkv. The inputs to `flex_attention` HOP is thus a query of shape (B=2, H=Hkv=2, M=Hq//Hkv=8, D=64), key,value of shape (B=2, H=Hkv=2, N=4096, D=64, which lead to an intermediate attention score matrix of shape (2, 2, 8, 4096) and an output of shape (2, 2, 8, 64). ```Python import torch from torch.nn.attention._flex_attention import _flex_attention as flex_attention torch.manual_seed(0) # Lets create some input tensors # query of shape (B, Hkv, Hq//Hkv, D) # key/value of shape (B, Hkv, N, D) query = torch.randn(2, 2, 8, 64, device="cuda", dtype=torch.float32) key = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32) value = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32) # Lets create a new score_modification checkerboard. def checkerboard(score, batch, head, token_q, token_kv): score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score) score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score) return score # Lets call flex_attention with this new score modification for decoding. # The flex_attention HOP will chose flex_decoding as its backend since our query length (M) is only 8. output = flex_attention(query, key, value, score_mod=checkerboard) compiled_flex_attention = torch.compile(flex_attention) out_compiled = compiled_flex_attention (query, key, value, score_mod=checkerboard) torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2) ``` ## Future Plans - This PR does not implement load mask for score_mod function. This means if the score_mod functions takes a captured buffer along the M dimension , it must be padded to q length of 16, or next 2^n of query length if q_len > 16. i.e. ```python q_scale = torch.randn(Hq//Hkv, device="cuda") q_scale = torch.nn.functional.pad(q_scale, (0, 16-Hq//Hkv)) # Pad captured buffer def bias_mod(score, batch, head, q, kv): score = score + q_scale[token_q] return score ``` - Backward path for short queries (<128 token) currently does not work because the `flex_attention_backward` kernel is lacking mask support and only takes query length of a multiple of 128. - Dynamic shape and max_autotuning is currently not working - Add block sparse mask support (#129216 is a draft for flex_attention kernel) - Add explicit GQA support. (#130076 is a draft for GQA support on flex_attention kernel) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129415 Approved by: https://github.com/Chillee	2024-07-13 00:41:48 +00:00
Justin Chu	f44739cf42	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-13 00:08:25 +00:00
Colin Peppler	a7f54c7f8a	[dynamo] add meta fn for aten.kthvalue.default (#130562 ) I saw ``` torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562 Approved by: https://github.com/jingsh, https://github.com/zou3519	2024-07-12 23:48:31 +00:00
Aaron Orenstein	634b62f111	typing proxy_tensor.py (#129182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182 Approved by: https://github.com/Chillee	2024-07-12 23:17:09 +00:00
PyTorch MergeBot	ea78b0c177	Revert "Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 )" This reverts commit a17d1e5322229a31f868d98987996a04736933a6. Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/izaitsevfb due to internal needs pybind update ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2226499397))	2024-07-12 23:07:37 +00:00
inkcherry	f422027fce	fix torch.linalg.lstsq input check (#130612 ) Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236) The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient). Consequently, it triggers an internal assertion error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612 Approved by: https://github.com/lezcano	2024-07-12 23:06:52 +00:00
Yifu Wang	06ebf87a1e	Fix and improve reorder_compute_for_overlap (#130573 ) Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes. - The unit tests are now fixed and re-enabled. - Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573 Approved by: https://github.com/Chillee ghstack dependencies: #129980	2024-07-12 22:25:49 +00:00
Mikayla Gawarecki	619029e892	[easy] Small rendering fix in Tensor.module_load doc (#130489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130489 Approved by: https://github.com/janeyx99	2024-07-12 22:12:53 +00:00
rzou	95046c86e3	[HOP] add HOP x torch_dispatch interaction (#130606 ) This involved beefing up the Python dispatcher to handle torch_dispatch. Given a HOP and a torch_dispatch Tensor subclass: - the HOP will show up in the subclass's `__torch_dispatch__` - you can also use HOP.py_impl to register a rule for the HOP x subclass interaction - (coming soon) we'll offer a way to open register HOP x subclass interaction without needing to touch the subclass's `__torch_dispatch__` or the HOP's .py_impl. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606 Approved by: https://github.com/ydwu4	2024-07-12 21:51:36 +00:00
rzou	f093cd4086	Fix custom ops warning during export (#130623 ) Fixes https://github.com/pytorch/pytorch/issues/130588 The problem was we were warning on all custom ops, not just ones marked as CompositeImplicitAutograd. This PR changes the warning to just warn on CompositeImplicitAutograd ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130623 Approved by: https://github.com/williamwen42	2024-07-12 21:34:29 +00:00
Mikayla Gawarecki	7c289c2a5c	Add torch.serialization.safe_globals context manager (#127939 ) Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939 Approved by: https://github.com/albanD	2024-07-12 20:38:43 +00:00
PyTorch MergeBot	f0d7164cb9	Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 2abc7cc21b8a215f000ac037c316ca178e9ade81. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to breaks meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2226313943))	2024-07-12 20:36:00 +00:00
albanD	103b6ccab2	Increase tolerance for tensorsolve tests (#130620 ) Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620 Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet	2024-07-12 20:08:18 +00:00
Scott Wolchok	af4da0799c	[PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465 ) As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en) is part of the base aarch64 instruction set, so it should work fine on mobile. Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-12 19:46:30 +00:00
dshi7	d727e2f2d1	add total wall time in calculate_time_spent (#130611 ) Fixes #ISSUE_NUMBER Actual wall time is fwd_entire_frame_time + bwd_inductor_compile. `calculate_time_spent` is accessed internally for monitoring use https://fburl.com/code/iiurj5m6. However, summing values up lose the info of fwd/bwd. This PR adds a new key of `total_wall_time` without affecting dynamo counters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130611 Approved by: https://github.com/oulgen, https://github.com/Yuzhen11	2024-07-12 19:32:44 +00:00
eqy	60fc01d0ab	[CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401 ) Repro from @eellison Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-12 18:57:07 +00:00
Bertrand Thia	43b98fa521	Add debug repr to SymNode (#129925 ) Fixes #129403 Create a separate printing function to debug SymNode, since we can't easily change `__repr__` that is used by GraphModule.recompile() to create a pythonic version of a graph This is my first contribution, please let me know if there is anything that I should look into in further details Thank you for you guidance! 🙏 I hope to contribute more in the future! @aorenste Pull Request resolved: https://github.com/pytorch/pytorch/pull/129925 Approved by: https://github.com/aorenste	2024-07-12 18:31:23 +00:00
Jack Taylor	2c4303c1d1	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters (#130617 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: ``` if torch.version.hip is not None: ``` Which was incorrectly replaced by: ``` if self.device_props.type != "hip": ``` Perhaps we need to write some unit tests here in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130617 Approved by: https://github.com/masnesral	2024-07-12 18:29:59 +00:00
Yidi Wu	741c1710e8	[cond] inlining into one of the branches when pred is a python constant (#130493 ) Reland https://github.com/pytorch/pytorch/pull/128709. When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493 Approved by: https://github.com/BoyuanFeng	2024-07-12 18:02:09 +00:00
Yidi Wu	0bf9a091ec	[torchbind] add tracing_mode support (#129586 ) Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object. This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class. Test Plan: `pytest test/export/test_torchbind.py -k safe` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586 Approved by: https://github.com/zou3519	2024-07-12 18:01:47 +00:00
William Wen	c3e77d144e	[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 ) Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython. This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame. We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12. This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185 Approved by: https://github.com/jansel	2024-07-12 17:56:38 +00:00
Tom Ritchford	b0a597fcb4	Fix #121334 : graph break on constant method call (#130158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130158 Approved by: https://github.com/lezcano	2024-07-12 17:34:46 +00:00
Chirag Pandya	4865c6425c	Add new control plane handler (#129712 ) Summary: Add a new control plane handler to retrieve flight recorder data as JSON. Test Plan: Unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129712 Approved by: https://github.com/wconstab	2024-07-12 17:32:01 +00:00
Nikita Shulga	55dc82bef9	[EZ] Make test_pytree_inputs actually run tests on CUDA (#130593 ) Right now it's only running it on CPU even when `self.device` is set to CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/130593 Approved by: https://github.com/angelayi	2024-07-12 17:17:28 +00:00
Pian Pawakapan	988ed4d5db	[export] clean up allow_complex_guards_as_runtime_asserts flag (#130596 ) Summary: removes underscore, cleans up dead code in DimConstraints Test Plan: existing export tests Reviewed By: angelayi Differential Revision: D59612746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596 Approved by: https://github.com/angelayi	2024-07-12 17:17:11 +00:00
Chien-Chin Huang	dafef3ff35	[CP] Make CP loss curve on par with TP (#129515 ) Summary: This PR changes two implementations to make CP (CP8) lose curve be on par with TP (TP8). 1. Making key and value contiguous before doing ring attention. It is unclear why this is a requirement as SDPA does not have this requirement. 2. Use the out, grad_out, softmax_lse passed by autograd to do the backward. This implementation is similar to the implementation in transformer engine. The original implementation reruns the SDPA to get the output and logsumexp and uses that reculcated results to infer the corrected softmax_lse. But that implementation does not give a better accuracy or lose curve. Instead, that implementation converges slower. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129515 Approved by: https://github.com/d4l3k, https://github.com/wanchaol ghstack dependencies: #129512, #129514	2024-07-12 16:55:28 +00:00
Nikita Shulga	c35f12c67c	[EZ] Add formatting changes to .git-blame-ignore-revs (#130627 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130627 Approved by: https://github.com/izaitsevfb, https://github.com/clee2000	2024-07-12 16:37:46 +00:00
Aidyn-A	22fd89c904	[TEST][Inductor] Fix scaled_mm call (#130582 ) `_scaled_mm` no longer returns `amax` (see #128683) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130582 Approved by: https://github.com/drisspg	2024-07-12 16:25:18 +00:00
Edward Z. Yang	34e57025e1	Add unsigned int types to torch/types.h (#130616 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130616 Approved by: https://github.com/NicolasHug, https://github.com/albanD	2024-07-12 16:24:29 +00:00
PyTorch MergeBot	2b1df24877	Revert "Make hashing a SymInt raise an error again (#130548 )" This reverts commit 3100455b8eeebdfbc3428ff9454579ac50666faf. Reverted https://github.com/pytorch/pytorch/pull/130548 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels.py https://github.com/pytorch/pytorch/actions/runs/9908970127/job/27377960411 `3100455b8e`. Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130548#issuecomment-2225912018))	2024-07-12 16:20:12 +00:00
leslie-fang-intel	2a1f22e57f	Change BN to eval before QAT Convert phase (#130598 ) Summary In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default` for a safe DCE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598 Approved by: https://github.com/jgong5, https://github.com/yushangdi	2024-07-12 16:03:56 +00:00
titaiwangms	18418a7dbb	[ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586 ) The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586 Approved by: https://github.com/justinchuby	2024-07-12 15:47:59 +00:00
Mayank Mishra	e5657024b5	Fix loss_parallel with BF16 logits (#130550 ) Fixes #130549 This PR uses the specific dtype for the `grad_input` buffer and fixes the error Pull Request resolved: https://github.com/pytorch/pytorch/pull/130550 Approved by: https://github.com/tianyu-l	2024-07-12 15:47:38 +00:00
Shangdi Yu	ea4b80e6d6	[FX][export] strict DCE pass, check schema for node impurity (#130552 ) Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects. For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully. A call_function node is impure if it has at least one mutable argument. Fixed the tests below: test_to_module_with_mutated_buffer_multiple_update_sub_later test_export_input_mutation_static_shape test_buffer_util Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552 Approved by: https://github.com/pianpwk	2024-07-12 15:43:27 +00:00
Nikita Shulga	febadda107	[MPS] Fix `torch.[all\|any]` for 5+D tensors (#130542 ) Workaround bug in `reductionAndWithTensor:` that kills app with the following assert if 5+D tensor as an input ``` Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76. ``` by reshaping the tensor to 2D/3D one before running the reduction. Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue Enabled `test_reduction_ops_5D` and added regression test to it Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #130541	2024-07-12 15:06:22 +00:00
Bert Maher	d443fbc025	[inductor] Cache precompilation functions based on configs (#130350 ) Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs. Add the config's hashes to the key to avoid this problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350 Approved by: https://github.com/eellison	2024-07-12 14:21:49 +00:00
rzou	9c69684af8	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-12 14:13:01 +00:00
rzou	ba941769b5	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-12 14:13:01 +00:00
Edward Z. Yang	ae3ac9cb64	Only test _is_param if doing instance check on Parameter base (#130578 ) Fixes https://github.com/pytorch/pytorch/issues/111348 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130578 Approved by: https://github.com/Skylion007	2024-07-12 13:55:13 +00:00
Edward Z. Yang	6f54e961ea	Add trace_shape_events artifact tracing for ShapeEnv events (#130473 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130473 Approved by: https://github.com/lezcano	2024-07-12 13:50:25 +00:00
Edward Z. Yang	3100455b8e	Make hashing a SymInt raise an error again (#130548 ) See https://github.com/pytorch/pytorch/issues/130547 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-07-12 13:49:56 +00:00
Will Constable	b75cc70875	[Pipelining] add looped schedules to fsdp/ddp test (#130563 ) It feels like an oversight that these were not tested, especially since the test case already handles multi schedules specially but no multi-schedules were being tested Pull Request resolved: https://github.com/pytorch/pytorch/pull/130563 Approved by: https://github.com/H-Huang	2024-07-12 13:39:47 +00:00
PyTorch MergeBot	da030e7add	Revert "[Inductor] FlexAttention supports partial masking (#130415 )" This reverts commit 207564bab1c4fe42750931765734ee604032fb69. Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))	2024-07-12 13:20:18 +00:00
Yanbo Liang	207564bab1	[Inductor] FlexAttention supports partial masking (#130415 ) This is the new version of #130235 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415 Approved by: https://github.com/Chillee	2024-07-12 07:19:28 +00:00
Chien-Chin Huang	e568c91a7b	[CP] Fix the incorrect ring schedule in the fwd and bwd (#129514 ) Summary: 1. The argument order for all_to_all_single is "block, output_split_size, input_split_sizes, pg". 2. Uses the correct ring order for the grad_kv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129514 Approved by: https://github.com/d4l3k, https://github.com/drisspg, https://github.com/wanchaol ghstack dependencies: #129512	2024-07-12 07:05:36 +00:00
Chien-Chin Huang	0d8dedb01b	[dtensor] Add dtensor to TORCH_LOGS (#129512 ) Summary: Add the basic log for dispatcher of dtensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/129512 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-07-12 06:50:53 +00:00
Harshavardhan Reddy Bommireddy	b6215f44ef	DCP checkpoint_dist_client integration (#130452 ) Summary: Integrate scope tracking with `checkpoint/fb/logging_handlers.py`. Add a map of uuid -> tracker context manager. when logging handler has following events: * `start`: create scope_tracker object, call `__enter__`, add to map with uuid * `end`: retrieve scope_tracker object by uuid, call `__exit__`. * `exception`: retrieve scope_tracker object by uuid, call `__exit__` with current exception info. Test Plan: Test with bento notebook (attached). with a runtime_error in finish_checkpoint method. scuba records: https://fburl.com/scuba/workflow_signpost/ddttgmv2 Differential Revision: D56654417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130452 Approved by: https://github.com/LucasLLC	2024-07-12 06:01:56 +00:00
Tarun Karuturi	ff25dfca5a	Save quantization_tag in export graph serialization (#127473 ) Summary: `quantization_tag` is a first class citizen metadata in quantization flows that is preserved by it. As we'll want to store the quantized exported graphs we also need to preserve this metadata as it's used in later flows. Only json supported metadata will be allowed to be serialized. Test Plan: Added test case Differential Revision: D57939282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127473 Approved by: https://github.com/angelayi	2024-07-12 05:06:40 +00:00
eellison	b7d287fbec	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-12 03:44:29 +00:00
Sijia Chen	ae0edadea0	[SDPA] Replace `masked_fill_` with `aten::where` (#130281 ) Summary: full context in D59385876 Based on the offline discussion with PT2 folks, we switched to change the SDPA impl to mitigate the AOTI lowering issue Test Plan: PYTORCH_TEST_FBCODE=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true caffe2/test/inductor:test_inductor -- -r test_sdpa_inference_mode_aot_compile Differential Revision: D59495634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130281 Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Skylion007, https://github.com/justinchuby	2024-07-12 03:04:31 +00:00
yan-yhy	c16e90fe06	The device_suffix in a test_name is "privateuse1" sometimes. (#130091 ) When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes. For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes. When setUpClass() didn't set it, the device_suffix would be "privateuse1". Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091 Approved by: https://github.com/zou3519	2024-07-12 02:51:40 +00:00
Yifu Wang	9ae40c6bc0	Fix and improve raise_comms and sink_waits (#129980 ) The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes. Correctness issues: - The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput). Effectiveness issues: - The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all. - The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node. This PR: - Make the passes handle mutation correctly. - Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues. - The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it. - Enable the tests in CI. Co-authored-by: Will Feng <yf225@cornell.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980 Approved by: https://github.com/yf225	2024-07-12 01:55:47 +00:00
Will Feng	c6a676add4	[Traceable FSDP2][Inductor] Add GroupedSchedulerNode to contain nodes that must be scheduled together (#128568 ) As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them). This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements: - At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise all of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule. - The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`. From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them). ---- Q: Can we leverage `ir.Subgraph`? A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that: 1. `ir.Subgraph` requires defining the subgraph in FX IR. 2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node. 3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR). ---- Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568 Approved by: https://github.com/eellison, https://github.com/mlazos	2024-07-12 01:42:38 +00:00
Michael Lazos	c101c4517a	Add python type for list iterators (#130511 ) Fixes https://github.com/pytorch/pytorch/issues/117026 Also not sure why this was missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130511 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/anijain2305	2024-07-12 01:14:18 +00:00
PyTorch MergeBot	536b5b19b5	Revert "Simplify c10::string_view (#130009 )" This reverts commit 10c7f037fe3271cb3865816c216007ba403f5347. Reverted https://github.com/pytorch/pytorch/pull/130009 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/130009#issuecomment-2224223526))	2024-07-12 00:46:49 +00:00
Feny Patel	7f2436014e	add MTIA as valid device type for prof averages (#130340 ) Summary: Add MTIA as valid device option for getting profile averages Test Plan: Tested with auto-trace on MTIA Differential Revision: D59486392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130340 Approved by: https://github.com/aaronenyeshi	2024-07-12 00:39:01 +00:00
PyTorch MergeBot	7ce5b5767c	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c9551a3f50efc8163d8508a3c2189536528577ac. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))	2024-07-12 00:37:04 +00:00
Shivam Raikundalia	b5b91b418d	[Easy] Update record_function Comment (#130561 ) Summary: Users have been confused why user annotations on GPU tracks do not show when doing GPU only tracing. This comment should help users understand that to use this function they need to have CPU activies enabled. Test Plan: N/A it is just updating a comment Differential Revision: D59649390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130561 Approved by: https://github.com/aaronenyeshi	2024-07-11 23:51:25 +00:00
Pian Pawakapan	18b7633bfb	[export] fix kwargs in run_decompositions() for training IR (#130553 ) Re-exporting GraphModule expects all inputs to be in args, though not in pytree-flattened format. This avoids failing when we run with a fx.Interpreter subclass in [AOTAutograd tracing](`973037be6a/torch/_functorch/_aot_autograd/traced_function_transforms.py (L760-L762)`). Removes 7 test failures for training IR export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130553 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4	2024-07-11 22:53:18 +00:00
Yidi Wu	26c2b92525	[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 ) Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph. This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident. Test Plan: Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op Differential Revision: [D59498728](https://our.internmc.facebook.com/intern/diff/D59498728) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680 Approved by: https://github.com/angelayi	2024-07-11 22:46:21 +00:00
Edward Z. Yang	9c6c0deadc	Add eager_compile_backwards_failure to tlparse (#130434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130434 Approved by: https://github.com/albanD	2024-07-11 22:35:33 +00:00
PyTorch MergeBot	d97d962082	Revert "Add decompositions for copy variants of view ops (#128416 )" This reverts commit 68751799b85aa7f659420801bdbb8451f01ab09a. Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))	2024-07-11 22:09:23 +00:00
PyTorch MergeBot	a2f630a9a4	Revert "Decompose expand_copy and permute_copy (#129476 )" This reverts commit 7d4cb2109823f1c4001dff62b461bb9eda07ca17. Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))	2024-07-11 22:06:15 +00:00
eellison	fc872e98f3	Infer prim tags from equivalent aten ones (#130367 ) Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367 Approved by: https://github.com/ezyang	2024-07-11 20:53:52 +00:00
Zhengxu Chen	726a287271	[export] Expand verifier to be multiple on ExportedProgram (#130364 ) Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field. Test Plan: CI Differential Revision: D59408546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-07-11 20:34:49 +00:00
mengph	5c6edd29ec	Turn on splitShare=1 to make the optimization of comm_split effective. (#129929 ) Fixes #129865 Currently, new_group will call ncclCommSplit in some cases. In theory, ncclCommSplit will bring performance and memory benefits. However, the config parameter of the ncclCommSplit function in pytorch does not set "splitShare=1", which results in the optimization of ncclCommSplit being turned off and the benefits being invalid. This PR turn on splitShare=1 to make the optimization of comm_split effective. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129929 Approved by: https://github.com/shuqiangzhang	2024-07-11 20:14:58 +00:00
Nikita Shulga	c50b189280	Move trunk windows builds to CUDA-12.1 (#130446 ) That should catch build regressions that were previously only detectable during the nightly builds Win + CUDA-11.8 builds and tests are still run as part of periodic workflow Pull Request resolved: https://github.com/pytorch/pytorch/pull/130446 Approved by: https://github.com/atalman	2024-07-11 19:50:57 +00:00
Tijmen Blankevoort	bc18863713	Corner-case fix for upscale_histogram in the new HistogramObserver (#130316 ) Summary: Small fix to the bucketize function that caused a run-time error in some corner cases. Test Plan: Unit tests Differential Revision: D59508432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130316 Approved by: https://github.com/jerryzh168	2024-07-11 19:49:21 +00:00
Yidi Wu	cd9bae30de	Allow kwargs in _remove_effect_tokens_pass (#130491 ) Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it. Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs Reviewed By: angelayi Differential Revision: D59603147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491 Approved by: https://github.com/angelayi	2024-07-11 19:03:19 +00:00
PyTorch MergeBot	578388bed8	Revert "Support for expandable segments with cuda graph trees (#128068 )" This reverts commit fdc83610f272610ce50d1a6f5b6354f2df1baabb. Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))	2024-07-11 18:58:13 +00:00
Yidi Wu	1cae60a87e	Caching attr_proxy for nn_module attribute to fix guard check failure (#130280 ) Fixes https://github.com/pytorch/pytorch/issues/129939 Differential Revision: [D59594605](https://our.internmc.facebook.com/intern/diff/D59594605) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130280 Approved by: https://github.com/anijain2305	2024-07-11 18:21:35 +00:00
Chien-Chin Huang	0a4fe2ff86	[DSD] Use no_grad() to make some operations faster and avoid possible memory leakage (#130355 ) Use no_grad() to make some operations faster and avoid possible memory leakage Pull Request resolved: https://github.com/pytorch/pytorch/pull/130355 Approved by: https://github.com/wz337	2024-07-11 18:18:08 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
PyTorch MergeBot	492de213e2	Revert "Change deprecated warning on dispatch_on_subclass to warn once (#130047 )" This reverts commit f21a21828ac6e16d903ee88f726fdb2278c04782. Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/albanD due to The failure on the PR are valid, they should not have been ignored ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2223488933))	2024-07-11 17:24:02 +00:00
Iris Zhang (PyTorch)	f21a21828a	Change deprecated warning on dispatch_on_subclass to warn once (#130047 ) Summary: Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead. More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/ Test Plan: Sandcastle Differential Revision: D59338775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047 Approved by: https://github.com/XilunWu	2024-07-11 17:02:26 +00:00
wz337	3896ba3260	[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 ) Fixes #ISSUE_NUMBER As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different. This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases. As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash. ``` test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-07-11 17:02:18 +00:00
Dmitry Nikolaev	72d9135679	increase tensor size to force out of memory exception on the latest generations of GPUs (#130334 ) This PR fixes profiler/test_profiler.py::.TestProfiler::test_oom_tracing Test expects OOM by allocating huge tensor. But MI300X has enough memory to allocate such a tensor. This PR increases tensor size with a large margin to force OutOfMemory exception on MI300X and future GPU generations Pull Request resolved: https://github.com/pytorch/pytorch/pull/130334 Approved by: https://github.com/jeffdaily, https://github.com/janeyx99	2024-07-11 16:59:40 +00:00
Nikita Shulga	9c1ba5ac10	[BE] Cleanup unused vars in MPS (#130541 ) And move `using namespace mps` outside of every function as there are no need to repeat it Use `getTensorsStringKey` instead of explicit `getMPSShapeString(getMPSShape(t)) + getMPSDataTypeString(t)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130541 Approved by: https://github.com/Skylion007	2024-07-11 16:48:03 +00:00
Edward Z. Yang	68ad3eb722	Do not set hints for mark_unbacked quantities (#130483 ) Fixes https://github.com/pytorch/pytorch/issues/130456 When we mark_unbacked a size, we actually DO have a hint for it (because we have a real, input tensor) for it, and previously, we were accidentally putting it into the hint field of SymNode. If marked unbacked size is zero or one, this can lead to inconsistency between hint compute and static evaluation compute under guard size oblivious, since that's the whole point of size oblivious. Answer is to scrub out hints on mark unbacked ints. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130483 Approved by: https://github.com/lezcano	2024-07-11 15:51:00 +00:00
chuanqiw	ca023f77bc	[CD] Add pytorch xpu wheel build in nightly (#129560 ) Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560 Approved by: https://github.com/atalman	2024-07-11 15:49:04 +00:00
Shangdi Yu	fb9bc6d74a	[custom op] add doc for CustomOpDef.set_kernel_enabled (#130406 ) <img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096"> <img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406 Approved by: https://github.com/zou3519	2024-07-11 15:47:35 +00:00
James Wu	5ed72ff5f5	Reduce all tensors to their metadata in AOTAutogradCache; add tests (#128583 ) This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this. One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior. Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583 Approved by: https://github.com/oulgen ghstack dependencies: #128335	2024-07-11 15:39:09 +00:00
Oguz Ulgen	be7bf20234	Add JK to enable fx graph cache for amd (#130463 ) Test Plan: ad hoc testing Differential Revision: D59593961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130463 Approved by: https://github.com/nmacchioni, https://github.com/mxz297	2024-07-11 15:28:38 +00:00
Jiang, Yanbing	6f662e9575	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-11 15:26:48 +00:00
cyy	c4a2b6a943	[2/N] Fix NVCC warnings (#130214 ) Follows #130191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130214 Approved by: https://github.com/ezyang	2024-07-11 14:46:53 +00:00
Animesh Jain	a833582dbb	[dynamo][tuple] Optimize guard for small tuples - helps conv2d guards (#130400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130400 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #130285, #130368, #130416	2024-07-11 14:13:24 +00:00
Animesh Jain	f7d7b94017	[dynamo][unspecialized-nn-module] Distinguish between user-defined and builtin nn module (#130416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130416 Approved by: https://github.com/jansel ghstack dependencies: #130285, #130368	2024-07-11 14:13:24 +00:00
Animesh Jain	fed8b0055f	[dynamo][bufgix] Fix the value for key manager (#130368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130368 Approved by: https://github.com/jansel ghstack dependencies: #130285	2024-07-11 14:13:19 +00:00
Animesh Jain	9c612df504	[dynamo][cpp-guards][QOL] Print NO_TENSOR_ALIASING guard once (#130285 ) NO_TENSOR_ALIASING guard lists all tensors. Printing it on every occurence is ugly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130285 Approved by: https://github.com/jansel	2024-07-11 14:13:14 +00:00
cdzhan	bac10cdd6f	[DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423 ) …ogger Fixes #129951 . Would you take a moment to review it? @LucasLLC Pull Request resolved: https://github.com/pytorch/pytorch/pull/130423 Approved by: https://github.com/Skylion007	2024-07-11 13:43:39 +00:00
Yifu Wang	0d66ccaf23	[IntraNodeComm] fix an issue where input check fails when running all-reduce on sub groups (#130492 ) Tested against the following snippet with `ENABLE_INTRA_NODE_COMM=1`. ```python import os import torch import torch.distributed as dist def main(): rank = int(os.environ["RANK"]) local_rank = int(os.environ["LOCAL_RANK"]) world_size = int(os.environ["WORLD_SIZE"]) torch.cuda.set_device(f"cuda:{local_rank}") dist.init_process_group("nccl") draft_group = dist.new_group([0, 1, 2, 3]) target_group = dist.new_group([4, 5, 6, 7]) inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp) expect = sum(range(world_size)) assert inp.eq(expect).all() if 0 <= rank < 4: inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=draft_group) expect = sum(range(4)) assert inp.eq(expect).all() else: inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=target_group) expect = sum(range(4, 8)) assert inp.eq(expect).all() torch.cuda.synchronize() dist.destroy_process_group() if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130492 Approved by: https://github.com/Chillee	2024-07-11 13:39:14 +00:00
PyTorch MergeBot	f261c6ebe8	Revert "[halide-backend] Update CI pin (#130258 )" This reverts commit 4fcfd475bea24b832da32a0c4d464dd87c73a2a9. Reverted https://github.com/pytorch/pytorch/pull/130258 on behalf of https://github.com/albanD due to Seems to have broken trunk pretty bad `4fcfd475be` ([comment](https://github.com/pytorch/pytorch/pull/130258#issuecomment-2222935064))	2024-07-11 13:26:01 +00:00
albanD	354edb232a	Make public binding test only consider files that are packaged in the wheels (#130497 ) In particular, when creating the PyTorch wheel, we use setuptools find_packages `551b3c6dca/setup.py (L1055)` which explicitly skips packages without `__init__.py` files (namespace packages) https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages. So this PR is reverting the change to stop skipping these namespace packages as, even though they are in the codebase, they are not in the published binaries and so we're ok relaxing the public API and importability rules for them. A manual diff of the two traversal methods: ``` torch._inductor.kernel.bmm torch._inductor.kernel.conv torch._inductor.kernel.flex_attention torch._inductor.kernel.mm torch._inductor.kernel.mm_common torch._inductor.kernel.mm_plus_mm torch._inductor.kernel.unpack_mixed_mm torch._strobelight.examples.cli_function_profiler_example torch._strobelight.examples.compile_time_profile_example torch.ao.pruning._experimental.data_sparsifier.benchmarks.dlrm_utils torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_disk_savings torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_forward_time torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_model_metrics torch.ao.pruning._experimental.data_sparsifier.lightning.tests.test_callbacks torch.ao.quantization.experimental.APoT_tensor torch.ao.quantization.experimental.adaround_fake_quantize torch.ao.quantization.experimental.adaround_loss torch.ao.quantization.experimental.adaround_optimization torch.ao.quantization.experimental.apot_utils torch.ao.quantization.experimental.fake_quantize torch.ao.quantization.experimental.fake_quantize_function torch.ao.quantization.experimental.linear torch.ao.quantization.experimental.observer torch.ao.quantization.experimental.qconfig torch.ao.quantization.experimental.quantizer torch.csrc.jit.tensorexpr.codegen_external torch.csrc.jit.tensorexpr.scripts.bisect torch.csrc.lazy.test_mnist torch.distributed._tensor.examples.checkpoint_example torch.distributed._tensor.examples.comm_mode_features_example torch.distributed._tensor.examples.comm_mode_features_example_argparser torch.distributed._tensor.examples.convnext_example torch.distributed._tensor.examples.torchrec_sharding_example torch.distributed._tensor.examples.visualize_sharding_example torch.distributed.benchmarks.benchmark_ddp_rpc torch.distributed.checkpoint.examples.async_checkpointing_example torch.distributed.checkpoint.examples.fsdp_checkpoint_example torch.distributed.checkpoint.examples.stateful_example torch.distributed.examples.memory_tracker_example torch.fx.experimental.shape_inference.infer_shape torch.fx.experimental.shape_inference.infer_symbol_values torch.include.fp16.avx torch.include.fp16.avx2 torch.onnx._internal.fx.analysis.unsupported_nodes torch.onnx._internal.fx.passes._utils torch.onnx._internal.fx.passes.decomp torch.onnx._internal.fx.passes.functionalization torch.onnx._internal.fx.passes.modularization torch.onnx._internal.fx.passes.readability torch.onnx._internal.fx.passes.type_promotion torch.onnx._internal.fx.passes.virtualization torch.utils._strobelight.examples.cli_function_profiler_example torch.utils.benchmark.examples.sparse.compare torch.utils.benchmark.examples.sparse.fuzzer torch.utils.benchmark.examples.sparse.op_benchmark torch.utils.tensorboard._convert_np torch.utils.tensorboard._embedding torch.utils.tensorboard._onnx_graph torch.utils.tensorboard._proto_graph torch.utils.tensorboard._pytorch_graph torch.utils.tensorboard._utils torch.utils.tensorboard.summary torch.utils.tensorboard.writer ``` These are all either namespace packages (which we want to remove) or package that are not importable (and tagged as such in the test). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130497 Approved by: https://github.com/aorenste	2024-07-11 13:22:04 +00:00
Eddie Yan	215013daad	[cuDNN][SDPA] Limit cuDNN SDPA head-dim to 128 (#130494 ) Limit cuDNN SDPA to head-dim 128 globally. Apparently the support for 256 is only for the forward on sm90+, which would be clunky to maintain as it would mean dispatching different for forward/backward. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130494 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2024-07-11 13:21:18 +00:00
cyy	9822fdc354	[7/N] Replace c10::optional with std::optional (#130510 ) Follows #130438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130510 Approved by: https://github.com/janeyx99	2024-07-11 13:21:05 +00:00
Wang, Eikan	f52b2ee90f	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`. Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman	2024-07-11 13:17:25 +00:00
Edward Z. Yang	2a51ccc77e	When translation validation is enabled, assert that hint is consistent (#130478 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130478 Approved by: https://github.com/lezcano	2024-07-11 13:02:31 +00:00
cyy	c9551a3f50	Make c10::string_view an alias of std::string_view (#130417 ) Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-07-11 12:31:06 +00:00
cyy	c5b66c3fe1	Enable -Werror=pedantic on torch targets (#130319 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130319 Approved by: https://github.com/ezyang	2024-07-11 12:27:32 +00:00
Isuru Fernando	5db9bd467e	Skip test_nnc_correctness for new op _unsafe_masked_index (#130375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375 Approved by: https://github.com/lezcano	2024-07-11 08:17:16 +00:00
Benson Ma	b1942a1af4	[fbgemm_gpu] Break up `fbgemm_cuda_utils.cuh`, pt 10 (#130468 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/2814 X-link: https://github.com/facebookresearch/FBGEMM/pull/19 - Break up `fbgemm_cuda_utils.cuh`, pt 10 Test Plan: ``` buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/jagged/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/tbe/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/sparse/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//smart/inference_platform_sp/llm_predictor_amd:service buck2 build --flagfile fbcode//mode/amd-gpu fbcode//hpc/ops:sparse_ops buck2 build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/benchmarks/operator_benchmark/pt:add_test ``` Reviewed By: spcyppt Differential Revision: D59545097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130468 Approved by: https://github.com/ezyang	2024-07-11 07:10:27 +00:00
Xu Han	79c41bb58a	[inductor] switch CppCodeCache to new cpp_builder. (#130132 ) Changes: 1. switch CppCodeCache to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-11 07:03:43 +00:00
Wanchao Liang	75ab027fbb	[dtensor] move bernolli to op strategy (#130286 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/130286 Approved by: https://github.com/awgu, https://github.com/yifuwang	2024-07-11 06:43:11 +00:00
Bilal Khan	fdc83610f2	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/zdevito, https://github.com/eqy	2024-07-11 05:33:09 +00:00
fduwjj	da24823e06	[BE][EZ] Migrate to new dcp save and load APIs (#130475 ) When I play with DCP for distributed inference, I found that we are still using deprecated APIs for DCP even in unit test. So this PR is using the new API with unified small letters "dcp". Pull Request resolved: https://github.com/pytorch/pytorch/pull/130475 Approved by: https://github.com/wz337	2024-07-11 04:13:39 +00:00
Will Feng	5835ff1ed5	[Easy][Inductor] Add comment for .min_order and .max_order (#130390 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130390 Approved by: https://github.com/anijain2305	2024-07-11 03:58:03 +00:00
Shangdi Yu	a4576dad34	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-11 03:39:07 +00:00
Will Constable	9f401187c7	[pipelining] Refactor test_schedule to fix "-k" (#130294 ) This is kind of a short-sighted workaround and we should actually come up with a way to fix this in general, but I got annoyed that I can't use -k to filter tests in test_schedule, and realized it's because we jam tests using the new MultiProcContinuousTest fixture together with old-style tests. For now I separate the two types of tests so -k works again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294 Approved by: https://github.com/H-Huang	2024-07-11 03:18:02 +00:00
Mikayla Gawarecki	dfd1d1971e	Fix warning when pickle.load torch.Storage (#130246 ) Fixes https://github.com/pytorch/pytorch/issues/130242 Since `torch.save` does not use pickle for storages, the `torch.load` in `_load_from_bytes` should not ever be called when `torch.load`-ing a checkpoint. Setting weights_only=False explicitly in `_load_from_bytes` to avoid the weights_only warning when using the pickle module Pull Request resolved: https://github.com/pytorch/pytorch/pull/130246 Approved by: https://github.com/albanD	2024-07-11 02:40:29 +00:00
Jason Ansel	4fcfd475be	[halide-backend] Update CI pin (#130258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258 Approved by: https://github.com/eellison	2024-07-11 02:26:16 +00:00
Jerry Zhang	df9d1b44e7	Preserve _numeric_debug_handle throguh deepcopy and re-export (#129287 ) Summary: * Added support for preserving it during deepcopy, need to remap the args since _numeric_debug_handle refers to the nodes in the graph TODO: need to fully support re-export, currently the metadata for output node is not preserved Test Plan: python test/test_quantization.py -k test_deepcopy_preserve_handle python test/test_quantization.py -k test_copy_preserve_handle all related tests: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129287 Approved by: https://github.com/zhxchen17	2024-07-11 02:19:41 +00:00
Edward Z. Yang	a205a53c50	Make sym_node log more useful (#130436 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130436 Approved by: https://github.com/Skylion007	2024-07-11 01:42:53 +00:00
Edward Z. Yang	79e34800c3	Suppress guards generated by empty_strided in ir_node_to_tensor (#130431 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130431 Approved by: https://github.com/IvanKobzarev	2024-07-11 01:19:11 +00:00
cyy	798b9652f7	[6/N] Replace c10::optional with std::optional (#130438 ) Follows #130408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130438 Approved by: https://github.com/janeyx99	2024-07-11 01:15:37 +00:00
leslie-fang-intel	5bc18ec0a1	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano ghstack dependencies: #130405	2024-07-11 00:50:50 +00:00
Vladimir Fokow	6adc725157	doc - fix the `max_norm` value in a note (#129687 ) `max_norm=True` is currently written in the note, but `max_norm` can be a `float`, NOT a `bool` (as the [docstring](`ec284d3a74/torch/nn/modules/sparse.py (L30)`) says). That note was created in #45595 The current pull request cleans it up. The value `True` in the note can confuse the users to think it can be a boolean. In fact, a counter-intuitive behavior will happen if users try to set it to `False`: it will be interpreted as 0, so the values of the embedding will become 0 - not what the users were expecting by setting it to `False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129687 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-07-11 00:01:17 +00:00
Sam Larsen	358da54be5	[inductor] Better messaging when triton version is too old (#130403 ) Summary: If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior: 1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem. 2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..." Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403 Approved by: https://github.com/eellison	2024-07-10 23:45:50 +00:00
Andrew Gu	ceedee23ec	[DTensor] Included meshes in cross-mesh error msg (#130454 ) The current error message is not actionable since we do not know which meshes are involved. Including the `__repr__` of each mesh in the error helps but is not always sufficient. `7d4cb21098/torch/distributed/device_mesh.py (L395-L408)` The problem is that `DeviceMesh.__eq__` is actually pretty involved, and we cannot see all parts of the `__eq__` criteria just from the `__repr__` (e.g. the thread ID). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130454 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-07-10 22:40:57 +00:00
Xu Han	2abc7cc21b	[inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-10 22:28:29 +00:00
Ivan Zaitsev	551b3c6dca	Use irange to avoid -Wsign-compare errors (#130388 ) Fixes meta-internal errors after importing #128753 (see [D59498679](https://www.internalfb.com/diff/D59498679)) ``` fbcode/caffe2/aten/src/ATen/Context.cpp:286:34: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare] for (auto index = 0; index < at::getNumGPUs(); index++) { ~~~~~ ^ ~~~~~~~~~~~~~~~~ 1 error generated. ``` Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130388 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-07-10 22:07:51 +00:00
PyTorch MergeBot	ce499eee0c	Revert "Add API for open registration between operators and subclasses (and modes) (#130064 )" This reverts commit c23d103afae65588772cb30037ea4110f01f6f41. Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/izaitsevfb due to fails internal builds, see [D59553526](https://www.internalfb.com/diff/D59553526) ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2221587575))	2024-07-10 21:50:32 +00:00
Chirag Pandya	83c95c48f7	Flight recoder data as JSON (#129505 ) Summary: Provide a new API to retrieve flight recorder data as JSON. The one minor difference between flight recorder as Pickle v/s JSON is that the JSON API does not retrieve stack traces at the moment. This ends up being far too much data. Test Plan: unit test Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505 Approved by: https://github.com/wconstab, https://github.com/d4l3k	2024-07-10 21:50:27 +00:00
PyTorch MergeBot	86bca69c5f	Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261 )" This reverts commit bb9a73f767526e0d23c60360db5212b6bed0e8bc. Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))	2024-07-10 21:43:28 +00:00
PyTorch MergeBot	e14a0f45ed	Revert "[reland][custom ops] infer schema (#130079 )" This reverts commit bef085bdfa62cc14589c70279de17108b2c2089f. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))	2024-07-10 21:40:16 +00:00
Jon Janzen	46c52661bc	Use a better cherry-pick strategy for stable pytorch w/ distribute changes (#129987 ) 1. Update the branch name from internal feedback 2. Only cherry-pick in the changes to these folders Pull Request resolved: https://github.com/pytorch/pytorch/pull/129987 Approved by: https://github.com/seemethere	2024-07-10 20:55:36 +00:00
Catherine Lee	80a421a54d	[TD] Pin numpy to 1.26.0 in indexer (#130442 ) Temporarily pin 1.26.0 to get the workflow working while I go sort out which dependencies need to be updated Succeeding run: https://github.com/pytorch/pytorch/actions/runs/9877733366/job/27280052419?pr=130442 Tested by adding my branch to the trust relationship for the policy and removing the environment Pull Request resolved: https://github.com/pytorch/pytorch/pull/130442 Approved by: https://github.com/atalman, https://github.com/malfet	2024-07-10 20:52:24 +00:00
PyTorch MergeBot	cd2638be09	Revert "[pipelining] Refactor test_schedule to fix "-k" (#130294 )" This reverts commit 1352f13f7827cd1862a6e0507fb17dccddf73dc2. Reverted https://github.com/pytorch/pytorch/pull/130294 on behalf of https://github.com/clee2000 due to broke lint https://github.com/pytorch/pytorch/actions/runs/9879591538/job/27286156803 ([comment](https://github.com/pytorch/pytorch/pull/130294#issuecomment-2221376073))	2024-07-10 20:26:58 +00:00
PyTorch MergeBot	b81767161e	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit 08d5423d339ac4b302f8ae6b63b334e032104753. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9879109008/job/27286339304 `08d5423d33` test was not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2221368245))	2024-07-10 20:22:24 +00:00
Pian Pawakapan	1b3b4c2fb9	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) (#130380 ) original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train) Summary: This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Test Plan: contbuild & OSS CI, see `940e4477ab` Original Phabricator Test Plan: Imported from GitHub, without a `Test Plan:` line. Differential Revision: D59543603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380 Approved by: https://github.com/izaitsevfb	2024-07-10 19:23:37 +00:00
Will Constable	1352f13f78	[pipelining] Refactor test_schedule to fix "-k" (#130294 ) This is kind of a short-sighted workaround and we should actually come up with a way to fix this in general, but I got annoyed that I can't use -k to filter tests in test_schedule, and realized it's because we jam tests using the new MultiProcContinuousTest fixture together with old-style tests. For now I separate the two types of tests so -k works again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294 Approved by: https://github.com/H-Huang	2024-07-10 18:32:51 +00:00
Feng Yuan	cf090e222e	Update torch-xpu-ops pin (ATen XPU implementation) (#130333 ) 1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`. 2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function. 3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-07-10 18:10:53 +00:00
Nikita Shulga	4b7ee51260	[BE][MPS] Cleanup optimizers code (#130453 ) - Fix C++20 forward compatibility warnings, namely ``` warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions] multi_tensor_apply_for_fused_optimizer<2, 512>(kernel_name, ``` - Use nested namespaces - Do not explicitly specify `at::` namespace for functions already implemented inside of that namespace - Use more convenience methods (rather than call by hand) - Use C++14 `return f();` for void functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/130453 Approved by: https://github.com/Skylion007	2024-07-10 18:00:05 +00:00
IvanKobzarev	08d5423d33	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-10 17:56:32 +00:00
PyTorch MergeBot	0beeac35fa	Revert "[cond] inlining into one of the branches when pred is a python constant (#128709 )" This reverts commit fe3e6878c4bb2a6001045c179fd7fa9838242558. Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/ydwu4 due to causing error on truck due to a land racing: `fe3e6878c4` ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2221104043))	2024-07-10 17:47:19 +00:00
Shivam Raikundalia	b4b7477d3f	Fix CPU Annotation Overlapping with Python Events (#129599 ) Summary: Currently we have an issue where CPU User annotations can overlap with python events in the event that a python event calls step() within the function itself. To combat this, we can move the left side of the user annotation to the beginning of the parent python function. We do this because when instantiating the profiler we already start on step 0. To implement this, we start by collecting all instances of ProfilerStep during post processing. Since TorchOps and Python events are sorted already, we can easily check if the current python event partially overlaps with the current ProfilerStep and, if so, alter the start time of the current ProfilerStep. We then move to the next ProfilerStep and continue iterating through all the python events. This keeps the time complexity of adding events to 'out' at O(s + n) -> O(n) post sorting, where "s" is the number of ProfilerSteps and "n" is the length of all events. Test Plan: Added unit test in which step() is called midway through a function. Afterwards, we print out a trace and then load the json to check that there are no overlaps. Also make sure that there is no regression in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129599 Approved by: https://github.com/aaronenyeshi	2024-07-10 17:33:56 +00:00
Ivan Zaitsev	6b3460ae0d	fix discrepancy from the export of #126601 (#130296 ) #126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-07-10 17:26:44 +00:00
Tom Ritchford	7d4cb21098	Decompose expand_copy and permute_copy (#129476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 17:12:01 +00:00
AIM \| Nara	a7aa066b09	Fix link to dynamo in torch/fx readme (#130233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130233 Approved by: https://github.com/janeyx99	2024-07-10 17:00:49 +00:00
Laith Sakka	a09910d3a9	add strobelight profile links to tlparse (#129703 ) Summary: title. Test Plan: buck2TORCH_TRACE=~/my_trace_log_dir buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compile_time_profiler_example tlparse ~/my_trace_log_dir result https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBrQJcL/index.html {F1726980413} Differential Revision: D59130581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129703 Approved by: https://github.com/aorenste	2024-07-10 16:53:21 +00:00
Yidi Wu	fe3e6878c4	[cond] inlining into one of the branches when pred is a python constant (#128709 ) When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709 Approved by: https://github.com/zou3519	2024-07-10 16:44:27 +00:00
atalman	9d94b122f0	Fix usage of USE_ROCM when calling cudaFuncGetAttributes (#130441 ) This fixes MSVC build regression introduced by https://github.com/pytorch/pytorch/pull/129710 as VC++ fails to unroll nested defines in the specific order and fails with ``` C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\int4mm.cu(984): error: "#" not expected here do { const cudaError_t __err = cudaFuncGetAttributes( &funcAttr, #if defined(USE_ROCM) (void *)func #else func #endif ); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\cuda\\int4mm.cu", __func__, static_cast<uint32_t>(991), true); } while (0); ``` Fixes https://github.com/pytorch/pytorch/issues/130437 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130441 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-07-10 16:30:43 +00:00
Richard Barnes	ae73489b7d	[codemod] Use C++17 [[fallthrough]] in 1 file inc caffe2/aten/src/ATen/native/cuda/DistributionTemplates.h (#130433 ) Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D59528276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130433 Approved by: https://github.com/malfet	2024-07-10 16:30:37 +00:00
Shangdi Yu	bef085bdfa	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-10 16:18:36 +00:00
chilli	ce4d95143f	Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 ) After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error. ![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250 Approved by: https://github.com/drisspg ghstack dependencies: #130227	2024-07-10 16:14:45 +00:00
chilli	a7715e36de	Add block mask utility support for batches and heads > 1 (#130227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227 Approved by: https://github.com/yanboliang	2024-07-10 16:14:45 +00:00
Shangdi Yu	c83b941141	[export] add dynamic shapes argument and infer from graph nodes (#129928 ) Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`. On a high level, the issue is caused by not detecting fake_mode when there's no input. Change plan: 1) we add a `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`. 2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg. 3) If the input is a graph module, then we can traverse the graph and check. 4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type. Fixes #129927 Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo. Change plan: check `gm_torch_level` graph's node meta "example_value" for fake mode in addition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928 Approved by: https://github.com/angelayi	2024-07-10 15:51:05 +00:00
cyy	d31f866b33	[BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409 ) AT_CORE_STATIC_WINDOWS was inherited from torch and is not used anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130409 Approved by: https://github.com/malfet	2024-07-10 15:50:47 +00:00
Chien-Chin Huang	81ea298600	Wrap the test func with try/except to always call destroy_process_group (#124961 ) This can avoid PG warning about not calling destry_pg Pull Request resolved: https://github.com/pytorch/pytorch/pull/124961 Approved by: https://github.com/wanchaol, https://github.com/wz337	2024-07-10 15:36:38 +00:00
Michael Eisel	81df076bfd	Fix Apple crash when running PyTorch with Metal API validation turned on (#130377 ) Fixes #130376 (at least, for my usage) There may be other places in the code base where `-setBytes:length:` is called with a length of 0 besides this, but this is the case that has triggered for me. Please let me know if there are any specific tests I should run. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130377 Approved by: https://github.com/malfet	2024-07-10 15:07:47 +00:00
Andres Lugo-Reyes	417c83e7cf	[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 ) Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560 This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069 unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping. The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966 Approved by: https://github.com/malfet	2024-07-10 14:53:41 +00:00
rzou	b38de2f9e2	[decomps] Fix aten._to_copy decomp (#130381 ) `aten._to_copy` can receive a python number as input. This occurs in torch.compile support for vmap (see #130188). Previously, this would raise an assertion error. This PR changes it so that if we see a python number, we call torch.scalar_tensor on it first (h/t @bdhirsh). Fixes #130362 Fixes #130188 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130381 Approved by: https://github.com/Chillee	2024-07-10 14:34:28 +00:00
cyy	bd3452f431	[5/N] Change #include <c10/util/Optional.h> to #include <optional> (#130408 ) Follows #130329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130408 Approved by: https://github.com/malfet	2024-07-10 14:29:43 +00:00
Li-Huai (Allan) Lin	99967e1119	[MPS][TYPE_PROMOTION] Fix Clamp (#130226 ) Summary: 1. Fixed #130201 by adding type promotion. 2. Added proper tests. 3. Found torch's type promotion is different from numpy as follows: ```python import torch import numpy as np np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64') torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32 ``` ~Not sure the proper way to handle it, it causes numpy ref tests to fail.~ Reason here, so think I'm gonna xfail it: `3c1cf03fde/test/test_ops.py (L260-L264)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226 Approved by: https://github.com/malfet	2024-07-10 14:27:39 +00:00
rzou	6ce0bd7d3b	[HOP] Use user directed names for variables where possible (#130271 ) Afaict the previous check was too strict. Removing it passes all the mutation tests (mutation checks happen via the TensorVariable's mutable_local). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271 Approved by: https://github.com/Chillee, https://github.com/ydwu4	2024-07-10 13:59:20 +00:00
PyTorch MergeBot	637cc8d27f	Revert "update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 )" This reverts commit 6367f02a0e136ced05c665301bcdaa4d76690457. Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main `6367f02a0e` ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))	2024-07-10 13:48:32 +00:00
atalman	a1590e16df	Add single Python 3.10, single Cuda 12.1 build with dependencies included (#130349 ) Build large wheel for Python 3.10, CUDA 12.1 that will be used in Colab. Build name: ``manywheel-py3_11-cuda12_1-full-build`` We still have all code to support the full build in builder repo, here: https://github.com/pytorch/builder/blob/main/manywheel/build_cuda.sh#L151 Test: ``` import sys import torch sys.version_info print(torch.__version__) sys.version_info 2.3.0+cu121 sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130349 Approved by: https://github.com/malfet	2024-07-10 12:57:39 +00:00
Li-Huai (Allan) Lin	cb2bce98de	[MPS][BE] Reduce the number of parameters encoded for no momentum fused SGD (#130131 ) Summary: 1. Reduce the number of parameters encoded for no momentum fused SGD 2. Use convenience functions `mtl_setBuffer` and `mtl_setBytes`. Just a BE, no significant performance difference is observed. Test plan: Relying on CI signals Pull Request resolved: https://github.com/pytorch/pytorch/pull/130131 Approved by: https://github.com/janeyx99, https://github.com/malfet	2024-07-10 07:58:38 +00:00
Jiang, Yanbing	6367f02a0e	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-10 07:38:42 +00:00
leslie-fang-intel	e29657efb6	[Inductor][CPP] Fix typo in merge rules (#130405 ) Summary There is a typo of the `CPU Inductor` group in `merge_rules.yaml` which should be `test/inductor/test_cpu_repro.py` instead of `test/inductor/test_cpu_repo.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130405 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-07-10 07:13:03 +00:00
cyy	10c7f037fe	Simplify c10::string_view (#130009 ) Make c10::basic_string_view a subclass of std::basic_string_view for easier replacement in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130009 Approved by: https://github.com/ezyang	2024-07-10 05:02:16 +00:00
Xuehai Pan	a17d1e5322	Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 ) Fix static `py::object`s with `py::gil_safe_call_once_and_store`. The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault. ```c++ void func() { static py::object obj = py::module_::import("foo").attr("bar"); ... } ``` The correct code is to use raw pointers rather than the instance. ```c++ void func() { static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")}; py::object obj = *obj_ptr; ... } ``` This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely. ```c++ void func() { PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage; py::object obj = storage .call_once_and_store_result( []() -> py::object { return py::module_::import("foo").attr("bar"); } ) .get_stored(); ... } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341 Approved by: https://github.com/ezyang	2024-07-10 04:23:37 +00:00
rzou	5abe7ebd41	Add new (private) capture_triton API (#130178 ) When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130178 Approved by: https://github.com/oulgen ghstack dependencies: #130177	2024-07-10 03:09:29 +00:00
rzou	99c68f7bea	Refactor TritonKernelVariable's logic so it can be shared (#130177 ) TritonKernelVariable's logic tells us how to go from a user-defined triton kernel and a grid to a call to the triton_kernel_wrapper_mutation HOP. We want to re-use this in a setting without Dynamo; in the next PR up, we create a new decorator (capture_triton) that, when applied to a triton kernel, transforms a call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177 Approved by: https://github.com/oulgen, https://github.com/ydwu4	2024-07-10 03:09:29 +00:00
Valentine233	868d9a4f12	[cpu][flash attention] fix nan issue (#130014 ) Fixes #127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130014 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-07-10 02:33:26 +00:00
Tom Ritchford	68751799b8	Add decompositions for copy variants of view ops (#128416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 01:39:09 +00:00
cyy	007e75958f	[4/N] Change #include <c10/util/Optional.h> to #include <optional> (#130329 ) Follows #130300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130329 Approved by: https://github.com/ezyang	2024-07-10 01:26:50 +00:00
awayzjj	9912209743	check if the input fx graph of aot_compile return tuple (#129824 ) Fixes https://github.com/pytorch/pytorch/issues/129719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129824 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2024-07-10 01:18:55 +00:00
cyy	85b8503621	[Caffe2] Remove Caffe2 documentation (#130089 ) Due to the removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089 Approved by: https://github.com/r-barnes, https://github.com/albanD	2024-07-10 00:52:16 +00:00
cyy	7a3ab1fe79	[structural binding][7/N] Replace std::tie with structural binding (#130216 ) Follows #120353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216 Approved by: https://github.com/albanD	2024-07-10 00:52:04 +00:00
PyTorch MergeBot	fb696bf264	Revert "Add block mask utility support for batches and heads > 1 (#130227 )" This reverts commit 64139987c0588f2eef198a0b9fd6904783b37b2c. Reverted https://github.com/pytorch/pytorch/pull/130227 on behalf of https://github.com/izaitsevfb due to breaks internal builds, please see D59498662 ([comment](https://github.com/pytorch/pytorch/pull/130227#issuecomment-2218842579))	2024-07-09 22:34:39 +00:00
PyTorch MergeBot	44815ed67e	Revert "Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 )" This reverts commit 3e48d927332915e1ecbd3c7f2c6b9680428f181e. Reverted https://github.com/pytorch/pytorch/pull/130250 on behalf of https://github.com/izaitsevfb due to depends on #130227 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130250#issuecomment-2218840674))	2024-07-09 22:32:54 +00:00
Catherine Lee	5b5a1f5202	Add on to Mark some test_decomp tests as slow on win #130260 (#130337 ) An add on to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337 Approved by: https://github.com/malfet	2024-07-09 22:30:53 +00:00
Joel Schlosser	fd43a2ba27	Forward fix for test_compare_cpu_cuda_float32 (#130360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360 Approved by: https://github.com/malfet ghstack dependencies: #128238	2024-07-09 22:28:39 +00:00
PyTorch MergeBot	3be4922a9d	Revert "[HOP] Use user directed names for variables where possible (#130271 )" This reverts commit adb65682affdfc37f724c02ea8c8930d3925fc07. Reverted https://github.com/pytorch/pytorch/pull/130271 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9863205414/job/27236960046 `adb65682af` Test not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130271#issuecomment-2218832643))	2024-07-09 22:24:39 +00:00
Zhengxu Chen	37d4d04309	[torchscript] Add logging for model id. (#130118 ) Summary: as title. Test Plan: CI Reviewed By: angelayi Differential Revision: D59348256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130118 Approved by: https://github.com/BoyuanFeng	2024-07-09 22:24:16 +00:00
Riley Dulin	fb5cb17fbe	[torch][fx] Add normalize_args constructor argument to FxGraphDrawer (#130348 ) Summary: When writing out Graphviz files for graphs, sometimes the arguments are all in a row and it's unclear which is which. Like for `aten.conv2d`, someone might not remember the stride, padding, dilation order. Add an option `normalize_args` (defaults to False) to normalize all args into kwargs. This should help the readability of a graph. Differential Revision: D59529417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130348 Approved by: https://github.com/mcremon-meta	2024-07-09 22:16:54 +00:00
Aaron Enye Shi	df83142131	[CCA][Memory Snapshot] Stop duplicating annotations to all device_traces (#130315 ) Summary: This diff fixes a bug, where all record_annotations will save a TraceEntry to each of the device_traces. Instead, we should only save annotations to the current device_trace that is being called by the thread calling the native allocator's recordAnnotation. Test Plan: CI and ran workloads on MVAI WPR FBR. Reviewed By: zdevito Differential Revision: D59477339 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130315 Approved by: https://github.com/zdevito	2024-07-09 21:38:47 +00:00
rzou	bb9a73f767	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-09 21:11:27 +00:00
rzou	c23d103afa	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-09 21:11:27 +00:00
PyTorch MergeBot	9c9744c3ac	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )" This reverts commit 940e4477ab0b81eea25051447cf5f599080c903f. Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/izaitsevfb due to breaking internal APS tests, see D59498864 ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2218724762))	2024-07-09 21:03:49 +00:00
Tristan Rice	f85bda8bdd	c10d/Handlers: expose running handlers from Python (#130149 ) This adds a `_run_handler` method that will invoke a specific handler. Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130149 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-07-09 20:20:59 +00:00
Tianyi Tao	1d93367cfa	Fix typo (#130305 ) Fixes #130241 that is a reopen pr of #130244, for possibly fixing the failed job Pull Request resolved: https://github.com/pytorch/pytorch/pull/130305 Approved by: https://github.com/Skylion007	2024-07-09 20:02:00 +00:00
Chen Lai	721a798886	add bits16 to graph dtype_abbrs (#130339 ) As title, patch the dtype in torch.fx.graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/130339 Approved by: https://github.com/angelayi	2024-07-09 19:58:51 +00:00
Jerry Mannil	42f647219a	[ROCm] Add int4 support (#129710 ) - Add AMD support for int4 kernel - Only supports CDNA2 and CDNA3 gpus for now - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types - Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus - Fix torchscript issues due to hipify for `__nv_bfloat16` type - TorchScript has its own implementation for bfloat16 type - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h) - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify Fixes #124699 Fixes pytorch-labs/gpt-fast/issues/154 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-09 19:49:12 +00:00
rzou	adb65682af	[HOP] Use user directed names for variables where possible (#130271 ) Afaict the previous check was too strict. Removing it passes all the mutation tests (mutation checks happen via the TensorVariable's mutable_local). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271 Approved by: https://github.com/Chillee, https://github.com/ydwu4 ghstack dependencies: #130255, #130268	2024-07-09 19:42:52 +00:00
cyy	a6345d3477	[CMake] [3/N] Remove unused code (#130322 ) Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322 Approved by: https://github.com/r-barnes	2024-07-09 19:33:33 +00:00
Tianyi Tao	3477ee38e4	fix the use of initial learning rate in the OneCycleLR example (#130306 ) Fixes #127649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306 Approved by: https://github.com/janeyx99	2024-07-09 18:58:07 +00:00
Peter Bell	3689471ea4	[inductor] Add FileCheck to flex attention epilogue test (#129343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343 Approved by: https://github.com/lezcano	2024-07-09 18:15:55 +00:00
Yifu Wang	c6cce976b2	Fix an issue where ENABLE_INTRA_NODE_COMM=1 + multiple process groups leads to failure (#130269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130269 Approved by: https://github.com/Chillee	2024-07-09 17:42:09 +00:00
Yidi Wu	cb4bec311a	Fix nodes has more than one output users after replace_set_grad_with_hop pass (#129716 ) Summary: Previously, when we inline the subgraphs that doesn't have a different require_grad environment, we didn't clean up the nodes's users in subgraph and direcly used them to to replace the output of the call_modules. This records dead depencies in node.users. This PR fixes this. Test Plan: Added a new test. Also see the torchrec tests: Step 1: buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 934687114 --output /tmp/934687114.zip --use-torchrec-eager-mp --use-manifold Step 2: buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true aimp/cli:cli -- --platform=aps --template=disagg_gpu_aps_pt2 --pt2 --model-entity-id=934687114 non-request-only-tagging torchrec-shard-and-quantize gpu-disagg-split assign-device materialize-weights script-and-save Differential Revision: D59132214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129716 Approved by: https://github.com/angelayi	2024-07-09 17:04:03 +00:00
Eddie Yan	e4c51d22c5	[cuDNN] Cleanup < 8.5 #ifdefs (#130283 ) We've said cuDNN 8.5 is the minimum supported version for a bit now Pull Request resolved: https://github.com/pytorch/pytorch/pull/130283 Approved by: https://github.com/Skylion007	2024-07-09 16:35:39 +00:00
Shangdi Yu	cab90b0049	[custom ops] disable kernel temporarily (#130190 ) Fixes #128621 Sometimes we want to disable the backend implementation for testing/benchmarking purposes. For example: ```python @custom_op("mylib::f", mutates_args=()) def f(x: Tensor) -> Tensor: return torch.zeros(1) print(f(torch.randn(1))) # tensor([0.]) @f.register_kernel("cpu") def _(x): return torch.ones(1) print(f(torch.randn(1))). # tensor([1.]) with f.set_kernel_enabled("cpu", enabled = False): print(f(0)) # tensor([0.]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130190 Approved by: https://github.com/williamwen42, https://github.com/zou3519	2024-07-09 16:13:50 +00:00
Richard Zou	edf273edf4	Revert some PRs (#130303 ) Summary: Revert https://github.com/pytorch/pytorch/pull/129346 thru https://github.com/pytorch/pytorch/pull/128893 For S430832 Test Plan: Tests Differential Revision: D59503843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303 Approved by: https://github.com/bdhirsh	2024-07-09 14:46:00 +00:00
cyy	71efbf701d	[3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300 ) Follows #130236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300 Approved by: https://github.com/ezyang	2024-07-09 13:32:57 +00:00
milesial	a5f816df18	Add more dtypes to __cuda_array_interface__ (#129621 ) `__cuda_array_interface__` was missing some unsigned integer dtypes as well as BF16. numba doesn't support BF16 so I skip tests for that one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129621 Approved by: https://github.com/lezcano	2024-07-09 10:47:19 +00:00
chilli	3e48d92733	Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250 Approved by: https://github.com/drisspg ghstack dependencies: #130160, #130106, #130224, #130227	2024-07-09 09:24:06 +00:00
eqy	86fb76e871	[SDPA] Clean up `print` in `test/test_transformers.py` (#130302 ) Left this in #125343, oops... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130302 Approved by: https://github.com/awgu	2024-07-09 09:20:52 +00:00
Yichen Yan	953c6476bd	[CMAKE] Look for `Development.Module` instead of `Development` (#129669 ) Based on the [cmake issue](https://gitlab.kitware.com/cmake/cmake/-/issues/23716) and [manylinux issue](https://github.com/pypa/manylinux/issues/1347), when building a python module, it should find the `Development.Module` module, not `Development`, which includes `Development.Module` and `Development.Embed`, and will expect the shared python library only. After this PR and before #124613, pytorch could be built with a static libpython (e.g. in manylinux). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129669 Approved by: https://github.com/malfet	2024-07-09 09:16:43 +00:00
Valentin Andrei	b139b5090f	[pytorch] Name threads in thread pools for better debugging (#130270 ) Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools. This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270 Approved by: https://github.com/d4l3k, https://github.com/albanD	2024-07-09 08:03:47 +00:00
Yuanhao Ji	312652c325	[RFC] Add support for device extension autoloading (#127074 ) Fixes #122468 - Load device extensions at the end of `torch/__init__.py` - Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0` run test: ```python python test/run_test.py -i test_autoload_enable python test/run_test.py -i test_autoload_disable ``` doc: https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html co-author: @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074 Approved by: https://github.com/albanD, https://github.com/jgong5	2024-07-09 06:14:13 +00:00
Aaron Enye Shi	6c4efd4e95	[Memory Snapshot][BE] Clean up record function callback scope (#130265 ) Summary: We can directly set the scope to at::RecordScope::USER_SCOPE for the at::RecordFunctionCallback object, rather than performing a check inside of the callback. Test Plan: Ran locally, works fine. https://www.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-aaronshi-20240704-1709-7a80b83b/0/rank-0_itrn-1503.Jul_04_17_24_02.3577.snapshot.pickle Differential Revision: D59477046 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130265 Approved by: https://github.com/davidberard98	2024-07-09 05:23:48 +00:00
Sam Larsen	ded469cfbd	[issue scrubbing] Fix imports in test_memory_planning.py to work with pytest (#130275 ) Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;) Test Plan: Ran both ways: * `python test/inductor/test_memory_planning.py` * `pytest test/inductor/test_memory_planning.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275 Approved by: https://github.com/zou3519	2024-07-09 05:20:56 +00:00
Xu Han	e235db98c9	[Inductor] Add aot_mode UT to new cpp_builder. (#130105 ) Changes: 1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT. 2. Switch AotCodeCompiler vec isa command gen to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-09 04:08:35 +00:00
Sheng Fu	31df1d235e	Support tensor stride (#129297 ) Summary: X-link: https://github.com/facebookresearch/param/pull/126 Support tensor stride for execution trace. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda profiler.test_execution_trace.TestExecutionTrace Differential Revision: D58900476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129297 Approved by: https://github.com/sanrise, https://github.com/izaitsevfb	2024-07-09 03:55:46 +00:00
Edward Z. Yang	e836ee1955	Enhancements to recompiles logs (#130043 ) ---- - We now record on CacheEntry what the compile id that populated it was, so now we can say why a specific frame was rejected - Add structured log for recompiles under name artifact "recompile_reasons". As it stands, it's not terribly structured, but this was the easiest thing I could do to start - Slightly reformat multi-reason printing; since we only report one guard failure seems better to have it as a single line Example output: ``` V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] Recompiling function f in /data/users/ezyang/a/pytorch/b.py:3 V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] triggered by the following guard failure(s): V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] - 0/0: tensor 'L['x']' size mismatch at index 0. expected 4, actual 5 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130043 Approved by: https://github.com/anijain2305	2024-07-09 03:40:56 +00:00
cyy	29861779ce	[2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236 ) Follows #128301. The changes were made by grep and sed Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236 Approved by: https://github.com/ezyang	2024-07-09 03:17:24 +00:00
rzou	d1e0653fad	[fx][easy] print_readable should recursively apply options (#130268 ) For example, print_readable(colored=True) should also print submodules with colors. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/130268 Approved by: https://github.com/Chillee ghstack dependencies: #130255	2024-07-09 02:50:20 +00:00
rzou	f2c9f0c0db	[HOP] improve naming for subgraph inputs (#130255 ) Previously, subgraph input names were whatever the input proxies were, which were confusing. This PR changes those names to be whatever the names of the arguments the functions being speculate_subgraph'ed are. This is best-effort: if we can't figure it out then we go back to the previous strategy. Test Plan: - existing expecttests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130255 Approved by: https://github.com/ydwu4	2024-07-09 02:46:40 +00:00
Jane Xu	abe81d5d05	Fix the rest of foreach flakers (#130277 ) Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277 Approved by: https://github.com/soulitzer	2024-07-09 02:08:21 +00:00
PyTorch MergeBot	d44c30e2f9	Revert "Add API for open registration between operators and subclasses (and modes) (#130064 )" This reverts commit 922d2737d5e0ad22ee1dcf91c48ab09d641de840. Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_profiler_tree is failing in trunk after this lands `922d2737d5`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2216135497))	2024-07-09 01:48:38 +00:00
Catherine Lee	75fa10066d	Mark some test_decomp tests as slow on win (#130260 ) Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows. These tests take >500s on windows. This is part of the reason why test_decomp keeps failing on windows (ex `da66e50e6e`) The other part is something to do with reruns + thresholds that I am still investigating Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-07-09 00:16:31 +00:00
Will Constable	7f08d3d9a0	[C10D] Fix corrupt log due to uint_8 printing as char (#130184 ) Previously, jobs would log lines like this due to interpreteting an int8 value as a signed char when streaming out. "ProcessGroupNCCL created ncclComm_ 0x94960120 on CUDA device: ^@" We need a better solution for avoiding this systematically, but at least for now fix the spot we know about. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130184 Approved by: https://github.com/eeggl, https://github.com/Skylion007	2024-07-08 23:37:50 +00:00
Jerry Zhang	4c19623800	Change numeric_debug_handle to store per-node id (#129811 ) Summary: Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack, but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional support for numerical debugging for inputs and willing to hack around to achieve this. This PR changes the structure of numeric_debug_handle to store unique_id for each node instead. e.g. graph: ``` node = op(input_node, weight_node) ``` Before: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3} ``` After: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1 ``` Test Plan: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811 Approved by: https://github.com/tarun292	2024-07-08 23:36:19 +00:00
Will Constable	a28bb3268d	[Pipelining] Reorder _Action from F1_1 to 1F1 (#129786 ) Also steers away from accesing _Action via positional unpacking since that is error prone Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786 Approved by: https://github.com/H-Huang	2024-07-08 23:07:51 +00:00
Huy Do	60d9f3f7d9	Set the epoch timestamp when uploading data to dynamoDB (#130273 ) This is to move away the `_event_time` field from Rockset, which we cannot use when reimport the data Pull Request resolved: https://github.com/pytorch/pytorch/pull/130273 Approved by: https://github.com/clee2000	2024-07-08 22:58:32 +00:00
Yueming Hao	b4cc25f126	[custom_op]Fix self in mutation_args (#130179 ) Fixes #124933 ## Issue Summary If users define `self` as mutate args, there is an error occurs `TypeError: AutoFunctionalized.__call__() got multiple values for argument 'self'`. For the following example, the schema for mutates_args is parsed as {"self": FakeTensor}. `6df963a2c8/torch/_higher_order_ops/auto_functionalize.py (L234)` In the above line, it is unwrapped as `self=FakeTensor` and leads to wrong argument pass because `self` is the default keyword for functions of a class, such as https://github.com/pytorch/pytorch/compare/main...findhao/fix-self-custom-ops#diff-9453b6b52a54783beec3dd1c60248620f61c3a524d404a188af17bbdf6be3d9eR292 . ```python import torch @torch.library.custom_op("mylib::foo", mutates_args={"self"}) def foo(self: torch.Tensor) -> None: self.sin_() x = torch.randn(3) @torch.compile(backend="inductor", fullgraph=True) def f(x): foo(x) f(x) ``` ## Fix This PR changes all related default argument `self` to `self_` following the existing way in `6fc771d19b/torch/_ops.py (L667)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130179 Approved by: https://github.com/zou3519	2024-07-08 22:55:50 +00:00
Andrey Talman	17ca0d0edf	Add linux manywheel python 3.13 binary workflows (#130030 ) Test with passing linux manywheel workflows is here: https://github.com/pytorch/pytorch/pull/121979 Builder PR already merged: https://github.com/pytorch/builder/pull/1910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130030 Approved by: https://github.com/albanD	2024-07-08 22:50:15 +00:00
Joel Schlosser	00335a27b4	Accept min / max sequence length in nested_tensor_from_jagged() constructor (#130175 ) This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on. NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity). API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction: 1. Min / max seq len has already been obtained from somewhere (manual calculation, static values, etc.) and they should be used in the cache 2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually) 3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175 Approved by: https://github.com/davidberard98, https://github.com/soulitzer	2024-07-08 22:14:52 +00:00
rzou	922d2737d5	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-08 22:13:05 +00:00
PyTorch MergeBot	44a773c121	Revert "[custom ops] infer schema (#130079 )" This reverts commit 3fe324ffb612c8712f6af7639c1e7bcec5f3b4fd. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit `3fe324ffb6` ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))	2024-07-08 22:02:29 +00:00
PyTorch MergeBot	f9bb258892	Revert "[Inductor] Add aot_mode UT to new cpp_builder. (#130105 )" This reverts commit 21eeedb4554edab22b42bcb2f75f19e85652b72e. Reverted https://github.com/pytorch/pytorch/pull/130105 on behalf of https://github.com/izaitsevfb due to Breaks 46 tests internally at meta with: OSError: CUDA_HOME environment variable is not set ([comment](https://github.com/pytorch/pytorch/pull/130105#issuecomment-2215392198))	2024-07-08 21:40:03 +00:00
PyTorch MergeBot	5e467604c3	Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit dc5f37193f8d144d3de8525bf64eb1775d91e932. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2215355259))	2024-07-08 21:25:28 +00:00
PyTorch MergeBot	09d57f577b	Revert "[inductor] switch CppCodeCache to new cpp_builder. (#130132 )" This reverts commit 3957b3b34976896e0b13e1d09cf19e1da5b8292e. Reverted https://github.com/pytorch/pytorch/pull/130132 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130132#issuecomment-2215352180))	2024-07-08 21:22:39 +00:00
Yang Chen	856fe230c7	[AOTI] better approach to generating runtime checks for symbolic dimensions (#130220 ) Previously, we only handled cases where the symbolic dimension is of Symbol. We should use bound_sympy which handles more general cases for us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130220 Approved by: https://github.com/aakhundov	2024-07-08 20:46:38 +00:00
Shangdi Yu	3fe324ffb6	[custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-08 20:46:23 +00:00
PyTorch MergeBot	1e61cb8c87	Revert "[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 )" This reverts commit b428f1ad77aedfd150e920c8b0d23b7e6393ad6f. Reverted https://github.com/pytorch/pytorch/pull/129185 on behalf of https://github.com/huydhn due to dr ci categorization is wrong, the test_linalg xsuccess is real, theres also a test_jit failure https://github.com/pytorch/pytorch/actions/runs/9844339391/job/27178009798 `b428f1ad77` ([comment](https://github.com/pytorch/pytorch/pull/129185#issuecomment-2215230345))	2024-07-08 20:37:07 +00:00
Anshul Sinha	f059201e0d	[dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn (#130072 ) Summary In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below: Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/130072 Approved by: https://github.com/XilunWu ghstack dependencies: #129994	2024-07-08 20:12:52 +00:00
atalman	3e53cae0fc	Release 2.4 matrix update. Future releases dates (#130267 ) Added Release Compatibility Matrix for release 2.4 Updated future release dates for 2.6-2.9 Updated possible patch release date for 2.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130267 Approved by: https://github.com/malfet, https://github.com/albanD	2024-07-08 20:09:17 +00:00
Xia, Weiwen	36e2608783	[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667 ) Description Add fusion path for dynamic quant and for QAT. The following patterns can be matched for static quant with QAT cases: `qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant` The following patterns can be matched for dynamic quant cases: `qx -> qlinear -> add -> optional relu` Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear python test/test_quantization.py -k test_linear_unary python test/test_quantization.py -k test_linear_binary Differential Revision: [D57655830](https://our.internmc.facebook.com/intern/diff/D57655830) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-07-08 20:04:39 +00:00
Tristan Rice	a8985a97f9	elastic/store: use wait instead of get for barrier (#130148 ) Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first. Test Plan: CI + existing tests -- no behavior change Differential Revision: D59396199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148 Approved by: https://github.com/kurman, https://github.com/wconstab	2024-07-08 19:53:42 +00:00
Jeeja	22c809aa73	[FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110 ) for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device. In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110 Approved by: https://github.com/fegin	2024-07-08 18:52:13 +00:00
James Wu	9158bb7837	Ignore functional tensor wrapper when caching (#128335 ) This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work. To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335 Approved by: https://github.com/bdhirsh	2024-07-08 18:39:20 +00:00
Michael Lazos	6dc64026cb	Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046 ) In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error. @eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092) To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning. Co-authored-by: David Berard <dberard@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046 Approved by: https://github.com/eellison	2024-07-08 18:25:16 +00:00
chilli	64139987c0	Add block mask utility support for batches and heads > 1 (#130227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227 Approved by: https://github.com/yanboliang ghstack dependencies: #130160, #130106, #130224	2024-07-08 18:15:35 +00:00
chilli	cd683212a2	Fix indexing twice with score_mod (#130224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224 Approved by: https://github.com/yanboliang ghstack dependencies: #130160, #130106	2024-07-08 18:15:35 +00:00
Jithun Nair	e16276b9bf	[ROCm] Check supported archs before setting preferred blas backend to hipblasLT (#128753 ) This PR is needed to resolve usability issues with PyTorch ROCm nightly wheels on non-gfx90a/gf94x architectures as a result of https://github.com/pytorch/pytorch/pull/127944. Addresses https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992 ### With this PR's changes, I get the following on a gfx908 (unsupported by hipblasLT) architecture: _Using setter function:_ ``` >>> torch.backends.cuda.preferred_blas_library(backend="cublaslt") [W617 19:58:58.286088851 Context.cpp:280] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator()) [W617 19:59:02.125161985 Context.cpp:291] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator()) <_BlasBackend.Cublas: 0> ``` _Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_ ``` root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_CUBLASLT=1 python >>> import torch >>> torch.backends.cuda.preferred_blas_library() [W619 06:14:11.627715807 Context.cpp:274] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator()) <_BlasBackend.Cublas: 0> ``` ### and the following on a gfx90a (supported by hipblasLT) architecture: _Using setter function:_ ``` >>> import torch >>> torch.backends.cuda.preferred_blas_library() <_BlasBackend.Cublaslt: 1> >>> torch.backends.cuda.preferred_blas_library(backend="cublas") <_BlasBackend.Cublas: 0> >>> torch.backends.cuda.preferred_blas_library(backend="cublaslt") [W620 18:38:29.404265518 Context.cpp:293] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator()) <_BlasBackend.Cublaslt: 1> ``` _Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_ ``` root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_HIPBLASLT=1 python >>> import torch >>> torch.backends.cuda.preferred_blas_library() <_BlasBackend.Cublaslt: 1> ``` (Same result for _Using `TORCH_BLAS_PREFER_CUBLASLT` env var:_) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128753 Approved by: https://github.com/malfet	2024-07-08 17:43:41 +00:00
William Wen	b428f1ad77	[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 ) Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython. This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame. We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12. This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185 Approved by: https://github.com/jansel	2024-07-08 17:39:05 +00:00
Jason Ansel	d325aaef39	[halide-backend] Use get_reduction_combine_fn for reduction ops (#130212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130212 Approved by: https://github.com/eellison	2024-07-08 17:23:32 +00:00
Anshul Sinha	a18568f293	[dtensor][debug] Added functionality to convert log into a json file (#129994 ) Summary Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below: <img width="542" alt="Screenshot 2024-07-02 at 3 40 41 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b9570540-e1d2-4777-b643-db4801b60ed8"> <img width="777" alt="Screenshot 2024-07-02 at 3 41 43 PM" src="https://github.com/pytorch/pytorch/assets/50644008/9296e255-c3ae-48a4-8be7-4273f69ee178"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump Pull Request resolved: https://github.com/pytorch/pytorch/pull/129994 Approved by: https://github.com/XilunWu	2024-07-08 17:15:34 +00:00
Abhinav Podili	61017eb77b	Add missing mapping between DLDevice and ATenDevice for MAIA (#129615 ) This PR adds missing mapping between the `DLDevice `and `ATenDevice `for MAIA device. These changes are necessary for `dlpack `support for `maia `tensors. [MAIA is added to the DldeviceType enum in the dlpack repo](`bbd2f4d324/include/dlpack/dlpack.h (L120)`) already. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129615 Approved by: https://github.com/albanD	2024-07-08 17:08:39 +00:00
Edan Tessel Sneh	63743b223c	[AO] catch qparam mismatch for cat (#123769 ) Summary: use &= instead of \|= since \|= ignores incorrect scale/zp change scale to use float comparison, instead of int comparison Issue warning instead of error for backward compatibility: ex: P1204628034 Test Plan: see warning in: P1204628034 Reviewed By: jerryzh168 Differential Revision: D55699212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123769 Approved by: https://github.com/jerryzh168	2024-07-08 16:47:14 +00:00
Catherine Lee	f4774d64bf	Skip test_profile_memory on windows (#130037 ) The test was introduced in https://github.com/pytorch/pytorch/pull/128743 It is failing on windows cuda `a9a744e442/1` (it is skipped on cpu jobs) After talking with the author and Aaron, I have been advised to skip it on windows, as windows support for kineto is not a high priority Pull Request resolved: https://github.com/pytorch/pytorch/pull/130037 Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi	2024-07-08 16:11:51 +00:00
PyTorch MergeBot	d7b7f8b79f	Revert "[ROCm] Add int4 support (#129710 )" This reverts commit d0ad13fa42fc2e9935bd3bda2937a3491276d274. Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))	2024-07-08 16:07:53 +00:00
Joel Schlosser	c8ab2e8b63	Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238 ) This PR: * Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed). * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`: * Uncovered a bunch of test issues: * Test breakdown (>100 total) * A lot of tolerance issues (tweaked tolerance values to fix) * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype) * 3 actually broken semantics (for masked tensor; added xfails) * 4 Jacobian mismatches (added xfails) * 2 nan results (skip for now, need fixing) * 3 results too far from reference result (add xfails) * Skips MPS tests for now (there are so many failures!). Those will default to the old behavior. before (no seed setting): ``` real 0m21.306s user 0m19.053s sys 0m5.192s ``` after (with seed setting): ``` real 0m21.905s user 0m19.578s sys 0m5.390s ``` * Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command. ``` ====================================================================== ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper return test(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar self.assertFalse(True) AssertionError: True is not false The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper fn(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper raise new_e from e Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='') To execute this test, run the following from the base repo dir: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.037s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238 Approved by: https://github.com/janeyx99, https://github.com/justinchuby	2024-07-08 16:06:38 +00:00
Feny Patel	acf9e31cf8	adding MTIA to supported activities (#130052 ) Summary: Put the hasMTIA block in the if condition as well to let MTIA activities be added to supported activities Test Plan: Tested with auto-trace Differential Revision: D59280848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130052 Approved by: https://github.com/aaronenyeshi	2024-07-08 15:20:05 +00:00
Alnis Murtovi	16d53cb7d5	Only run mixed_mm heuristic if shapes are static (#130081 ) If we have dynamic shapes, the heuristic in mixed_mm will cause a crash, because it cannot compare m, k and n to integer values. This PR makes it so that the heuristic only runs if we have static shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130081 Approved by: https://github.com/Chillee	2024-07-08 14:20:55 +00:00
Simon Fan	010009e642	[compiled autograd] c++ autograd function saved_data: lift tensors (#130057 ) avoid recompiles when custom c++ autograd function use ctx->saved_data to save tensors iv.toTensor can return reference for `after(iv.toTensor())` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130057 Approved by: https://github.com/jansel	2024-07-08 07:42:07 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
Animesh Jain	f053be2a97	[dynamo] Graph break on random_ op (#130222 ) Fixes https://github.com/pytorch/pytorch/issues/121621 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130222 Approved by: https://github.com/jansel	2024-07-08 06:10:24 +00:00
Sijia Chen	31bb65de19	[Inductor] Fix conditional codegen (#129492 ) Summary: We have the cache to guarantee the `sym` is codegen only once, see the following code ``` def ensure_size_computed(self, sym: sympy.Symbol): if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE): if sym in self.computed_sizes: return self.computed_sizes.add(sym) expr = V.graph.sizevars.inv_precomputed_replacements[sym] self.writeline( f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}" ) ``` However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of `undefined symbols`: P1441378833 To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen Test Plan: TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size Differential Revision: D58973730 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492 Approved by: https://github.com/aakhundov	2024-07-08 05:33:47 +00:00
Animesh Jain	c5c9dbece1	[dynamo][user-defined] Simplify and improve scope of UserDefinedObject var_getattr (#130169 ) Fixes https://github.com/pytorch/pytorch/issues/122649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130169 Approved by: https://github.com/jansel ghstack dependencies: #118448, #130159	2024-07-08 04:10:56 +00:00
Jerry Mannil	d0ad13fa42	[ROCm] Add int4 support (#129710 ) Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction. Only supports CDNA2 and CDNA3 gpus for now. Fixes #124699 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-07 23:54:22 +00:00
Animesh Jain	d1b832e739	[inductor][mkl][inline-inbuilt-nn-modules] Change assertion (#130219 ) Fixes the test in the next PR - `python test/inductor/test_mkldnn_pattern_matcher.py -k TestDynamicPatternMatcher.test_conv3d_unary_dynamic_shapes` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130219 Approved by: https://github.com/leslie-fang-intel	2024-07-07 21:32:07 +00:00
Pian Pawakapan	940e4477ab	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] # something with _w ... # turns into -> s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) # turns into torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599 Approved by: https://github.com/ezyang	2024-07-07 20:10:14 +00:00
Simon Mahns	0c44684901	[Typo] Fix typo in DispatchKeyExtractor.h (#130221 ) Summary: typo_helper Test Plan: ci Differential Revision: D59424671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130221 Approved by: https://github.com/Skylion007	2024-07-07 19:43:31 +00:00
PyTorch MergeBot	e423224546	Revert "[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 )" This reverts commit 98929ceae3873f18f4747b88cdff708fde107aa7. Reverted https://github.com/pytorch/pytorch/pull/126967 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/126967#issuecomment-2212337926))	2024-07-07 06:16:32 +00:00
PyTorch MergeBot	1b57dce35f	Revert "[Inductor][CPP] Support more than one LocalBuffer (#129121 )" This reverts commit f794cf59bd0891ff4a4337e0d919ee68ba1f0472. Reverted https://github.com/pytorch/pytorch/pull/129121 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/129121#issuecomment-2212337590))	2024-07-07 06:13:40 +00:00
leslie-fang-intel	f794cf59bd	[Inductor][CPP] Support more than one LocalBuffer (#129121 ) Summary Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion ``` Next Step - [✓] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126967	2024-07-07 05:43:08 +00:00
leslie-fang-intel	98929ceae3	[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 ) Summary Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)`). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)`). In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach. In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion ``` Next Step - [ ] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-07 05:34:57 +00:00
Xuehai Pan	a3ce9eddd6	[BE][Easy] apply autofix for ruff rule unnecessary-literal-set (C405) and unnecessary-map (C417) (#130198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130198 Approved by: https://github.com/Skylion007	2024-07-07 00:58:22 +00:00
peaceorwell	9983242c8e	[inductor] support adding a new inductor backend using PrivateUse1 (#129953 ) Add handling custom device registered by PrivateUse1 in init_backend_registration() func Fixes #129952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129953 Approved by: https://github.com/jansel	2024-07-06 21:15:40 +00:00
Shuo Ding	3d138af943	[Inductor] First implementation of the B2B-GEMM pass with tests (#129995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129995 Approved by: https://github.com/eellison	2024-07-06 19:10:22 +00:00
Xu Han	3957b3b349	[inductor] switch CppCodeCache to new cpp_builder. (#130132 ) Changes: 1. switch CppCodeCache to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 18:57:44 +00:00
Xu Han	dc5f37193f	[inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 18:44:13 +00:00
cyy	dfe3534134	[1/N] Fix NVCC warnings (#130191 ) Fixes NVCC warnings, as the required steps to enable Werror on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130191 Approved by: https://github.com/Skylion007	2024-07-06 18:25:04 +00:00
Xuehai Pan	3f50e197c4	[BE] annotate `torch.autograd.graph` (#129558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558 Approved by: https://github.com/soulitzer	2024-07-06 18:14:16 +00:00
Xu Han	01ec03bac6	[inductor] switch HalideCodeCache to new cpp_builder. (#130146 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130146 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 17:35:17 +00:00
cyy	2f219f7d79	Enforce unused-{variable/function} checks to all torch targets (#130189 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130189 Approved by: https://github.com/ezyang	2024-07-06 16:03:01 +00:00
cyy	096eca2f9a	[2/N] Replace exceptions with static_assert(false) in some templates (#130116 ) Follows #127371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130116 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-06 13:23:05 +00:00
Nikita Shulga	520a4642bf	[CI] Enable build with asserts (#129924 ) Not a standard CMake config, as far as I can tell, but it introduces an important concept of optimized build without `NDEBUG`. Test by running `python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)"`, which is a no-op unless debug_assert_fail is enabled. Add recently added `_unsafe_masked_index`/`_unsafe_masked_index_put_accumulate` to DONT_ENFORCE_SAME_TENSOR_IMPL_OR_STORAGE to avoid all test involving those ops to fail with internal assert Suppress number of internal asserts to make CI green, see https://github.com/pytorch/pytorch/issues/130073 Fixes https://github.com/pytorch/pytorch/issues/102105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129924 Approved by: https://github.com/atalman, https://github.com/albanD	2024-07-06 13:14:32 +00:00
chilli	da66e50e6e	Added compile option to create_block_mask (#130106 ) Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 trillion element tensor if fully materialized. ``` print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 220, 220, _compiled=True))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106 Approved by: https://github.com/yanboliang ghstack dependencies: #130160	2024-07-06 08:09:56 +00:00
PyTorch MergeBot	963f430d13	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )" This reverts commit 0267b2ddcb58aa66b2b62336216da7df4f9939d8. Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a landrace and fails inductor/test_cudagraph_trees in trunk `0267b2ddcb` ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2211690518))	2024-07-06 07:20:05 +00:00
Aaron Enye Shi	aa4899eee9	[CCA][Memory Snapshot] Fix race on alloc_trace vector - S430480 (#130180 ) Summary: Multiple threads can be calling the alloc_trace std::vector, which will result in SIGSEGVs when objects are double freed, accessed after free, or two inserts at the same time. We need to lock when inserting, accessing or removing TraceEntry in alloc_trace. Test Plan: This is a rare crash, which was exposed when we introduced recordAnnotations, which saves record_function annotations into the snapshot files. Saving a lot of annotations can trigger this bug. Here are a few jobs that crashed before, and this diff fixes. Differential Revision: D59380507 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130180 Approved by: https://github.com/eqy, https://github.com/kit1980	2024-07-06 06:14:54 +00:00
PyTorch MergeBot	e019540c9e	Revert "Fix the SDPA AOT export issue (#130164 )" This reverts commit 1927c406844affbfe3496d5cbc31d4ebe11c8bfb. Reverted https://github.com/pytorch/pytorch/pull/130164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ExecuTorch tests in trunk `1927c40684` ([comment](https://github.com/pytorch/pytorch/pull/130164#issuecomment-2211667777))	2024-07-06 05:59:49 +00:00
chilli	bf609630ae	Fix a bunch of stride issues with FlexAttention (#130160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130160 Approved by: https://github.com/yanboliang	2024-07-06 03:58:14 +00:00
Edward Z. Yang	10c831567b	Make sympify'ing SymInt/etc produce their sympy expression (#130166 ) There is one huge problem this fixes: today, sympify(symint) produces a float(!!) because Sympy attempts to see if you can coerce the symint to float in sympify and of course this works on SymInt. However, this also has another nontrivial effect: anywhere in Inductor where sympy expressions are passed around, it is also valid to pass around a SymInt now. I'm ambivalent about this: it's currently a mistake to be passing around a SymInt when a sympy expression is expected. But maybe this is fine? Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166 Approved by: https://github.com/yf225	2024-07-06 03:56:45 +00:00
Jason Ansel	acd03ca2d9	[halide-backend] Support scan kernels (#129035 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129035 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #130129	2024-07-06 03:49:50 +00:00
Jason Ansel	c5110f6388	[halide-backend] Use 0D scalar inputs/outputs (#130129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130129 Approved by: https://github.com/shunting314	2024-07-06 03:49:50 +00:00
Pian Pawakapan	0267b2ddcb	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] # something with _w ... # turns into -> s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) # turns into torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599 Approved by: https://github.com/ezyang	2024-07-06 03:44:49 +00:00
PyTorch UpdateBot	7c43f59a45	[audio hash update] update the pinned audio hash (#129429 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429 Approved by: https://github.com/pytorchbot	2024-07-06 03:34:12 +00:00
Animesh Jain	bd0252fb98	[dynamo][user-defined] Support method descriptors (#130159 ) Fixes https://github.com/pytorch/pytorch/issues/120650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130159 Approved by: https://github.com/jansel ghstack dependencies: #118448	2024-07-06 02:03:09 +00:00
Daulet Askarov	a1a2023eb8	Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972 ) Summary: It turns out, the device used as a param in is_pinned is meant to be the accelerator device with the respect to which pinning is expected. Passing 'cpu' always makes the return value false, regardless of whether the actual tensor is a cpu tensor pinned to Cuda. Besides, there is a PR https://github.com/pytorch/pytorch/pull/126376 about to be merged which automatically uses the correct accelerator device which obviates the need for users to pass any kind of explicit device and doesn't create Cuda context for pure cpu tensors. Note, https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 test is expected to be broken by this diff, but it should be fixed forward by https://github.com/pytorch/pytorch/pull/126376 Test Plan: Sandcastle. Differential Revision: D59283190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129972 Approved by: https://github.com/LucasLLC	2024-07-06 01:07:32 +00:00
Sijia Chen	1927c40684	Fix the SDPA AOT export issue (#130164 ) Summary: ## Context TL;DR: aot_export failed for SDPA memory efficient backend when using `inference_mode` The CMF AOTI lowering started to fail on the trunk. We have the script (https://fburl.com/code/kfk64i5s) to reproduce the issue quickly (log: P1469307638). By bisecting the stack, we found the issue starting from the D58701607 ## Root Cause In the `inference_mode()`, the `aten::scaled_dot_product_attention` was not decomposed before the `functionalization` and the op it-self was an out-place op, so the `functionalization` doesn't make change and then was decomposed into `masked_fill_.`, then decomposed to the `copy_` So it's `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (decompose) ---> `copy_` ---> failure In the `torch.no_grad()`, `aten::sdpa` was decomposed before `functionalization`, so the story is `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` --- (decompose) ---> `out-place ops` ---> good ## How to fix Long-term: The issue was tracked in the ticket (https://github.com/pytorch/pytorch/issues/129418). The long-term fix could be we do one more round of `functionalization` after the `decompose`, like `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` ---> good Short-term: It would be a big change I guess. To unblock the production use-case, I marked the `aten::sdpa` should be decomposed in this diff Test Plan: local repro works now buck run mode/opt scripts/sijiac/prototypes:sdpa_aoti Differential Revision: D59385876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130164 Approved by: https://github.com/zou3519	2024-07-06 00:57:47 +00:00
Shunting Zhang	c5ede865c4	[pt2-bench] raise tolerance for squeezenet1_1 (#130165 ) The training accuracy for this model starts to regress. It does not show up on the weekly run yet but 1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c) 2. I can repro it locally Command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend inductor --device cuda --only squeezenet1_1 ``` Raise the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005, #130163	2024-07-06 00:49:15 +00:00
Shunting Zhang	0fcbca9adb	[pt2-bench] use eval mode for vision_maskrcnn (#130163 ) Try to fix https://github.com/pytorch/pytorch/issues/130161 The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors I fix that to always use eval mode for vision_maskrcnn training. With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005	2024-07-06 00:49:15 +00:00
cyy	e5841bb8d5	[3/N] Enforce unused-function and unused-variable checks (#130084 ) Follows #129878. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130084 Approved by: https://github.com/ezyang	2024-07-05 23:56:00 +00:00
Shuqiang Zhang	126796d239	[c10d] fixing an UT after a change in eager mode new group (#130167 ) Summary: after https://github.com/pytorch/pytorch/pull/129284, new_group is eager now if device_id is specified, one UT was broken This PR fixes it. Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130167 Approved by: https://github.com/wconstab	2024-07-05 23:18:30 +00:00
Xuehai Pan	d1d0a7080f	[torchgen] reference generated comment to actual location of the generator and template (#130020 ) As per title. ```diff # torch/_VF.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/return_types.pyi - # @generated from torch/_C/return_types.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in ``` ```diff # torch/_C/__init__.pyi - # @generated from torch/_C/__init__.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in ``` ```diff # torch/_C/_nn.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in ``` ```diff # torch/_C/_VariableFunctions.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/nn/functional.pyi + # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020 Approved by: https://github.com/ezyang	2024-07-05 21:47:14 +00:00
PyTorch MergeBot	6fc771d19b	Revert "Change depreacate warning on dispatch_on_subclass to warn once (#130047 )" This reverts commit 8ff243bcf190bab62348310693f0ad2f90061c89. Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function on multiple jobs `8ff243bcf1` https://github.com/pytorch/pytorch/actions/runs/9812489165/job/27097342443. Dr CI is doing something weird about the unstable failures ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2211409090))	2024-07-05 21:03:36 +00:00
Catherine Lee	df50452279	Pin optree==0.11.0 on windows CI (#130155 ) Fixes #ISSUE_NUMBER doctests test_testing Failing run has 0.12.0 https://github.com/pytorch/pytorch/actions/runs/9804335516/job/27072891998 Succeeding run has 0.11.0 https://github.com/pytorch/pytorch/actions/runs/9798330845/job/27057359554 It is already pinned for mac and linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/130155 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-07-05 20:28:58 +00:00
Lucas Pasqualin	18e75c098b	[DCP] Adds Checkpointing Team (dcp) to merge rules (#129582 ) [DCP] Adds Checkpointing Team (dcp) to merge rules. Please comment to this PR if you think you should be added as well! Pull Request resolved: https://github.com/pytorch/pytorch/pull/129582 Approved by: https://github.com/fegin	2024-07-05 20:09:31 +00:00
Eddie Yan	739fc01ac9	[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 ) The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally: `a21d4363d2/c10/cuda/CUDAStream.h (L132)` OUTDATED below: The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following: ``` import logging import os import time import torch import torch.distributed as dist def main(): logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s") backend = 'nccl' group = torch.distributed.init_process_group(backend=backend) rank = torch.distributed.get_rank(group=group) for i in range(4): time.sleep(rank) logging.info(f"Rank {rank}: enter barrier {i}") dist.barrier() logging.info(f"Rank {rank}: exit barrier {i}") dist.destroy_process_group() if __name__ == "__main__": main() ``` appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead. The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization. This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device. CC @wujingyue @Aidyn-A @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908 Approved by: https://github.com/kwen2501	2024-07-05 19:53:54 +00:00
Huy Do	faebaef089	[EZ] Fix typo in upload stats OIDC rolename (#130168 ) My mistake from https://github.com/pytorch/pytorch/pull/129544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130168 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman	2024-07-05 19:38:24 +00:00
PaliC	3d56673b24	[Split Build][BE] remove extraneous .py, .a, and .so files (#130053 ) Removes extraneous .a, .so, and .py files from the split build. From here we can also clean up the builder script which produces the binary to do this. That pr is https://github.com/pytorch/builder/pull/1912 Verification: The built wheel with BUILD_LIBTORCH_WHL=1 has the following files only (with .a, .so, and .py extensions) ``` sahanp@devgpu086 ~/p/dist (viable/strict)> pwd (pytorch-3.10) /home/sahanp/pytorch/dist sahanp@devgpu086 ~/p/dist (viable/strict)> find . -type f $ -name ".py" -o -name ".a" -o -name "*.so" $ (pytorch-3.10) ./torch/__init__.py ./torch/lib/libbackend_with_compiler.so ./torch/lib/libc10.so ./torch/lib/libjitbackend_test.so ./torch/lib/libtorch.so ./torch/lib/libtorch_cpu.so ./torch/lib/libtorch_global_deps.so ./torch/lib/libtorchbind_test.so sahanp@devgpu086 ~/p/dist (viable/strict)> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130053 Approved by: https://github.com/atalman	2024-07-05 19:05:32 +00:00
Iris Zhang (PyTorch)	8ff243bcf1	Change depreacate warning on dispatch_on_subclass to warn once (#130047 ) Summary: Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead. More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/ Test Plan: Sandcastle Differential Revision: D59338775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047 Approved by: https://github.com/XilunWu	2024-07-05 18:52:49 +00:00
PyTorch MergeBot	784e3b4123	Revert "Change numeric_debug_handle to store per-node id (#129811 )" This reverts commit a9a744e442975cfbc6f4b26a532e5c1b3d9d5692. Reverted https://github.com/pytorch/pytorch/pull/129811 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129811#issuecomment-2211245852))	2024-07-05 18:14:02 +00:00
Huy Do	889ed48a22	Fix missing id-token write in upload stats (#130153 ) Fix the mistake from https://github.com/pytorch/pytorch/pull/129544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130153 Approved by: https://github.com/clee2000	2024-07-05 18:05:46 +00:00
Jiashen Cao	7c5f3cd049	Add explain function to TSConverter. (#129968 ) Summary: The explain function does a conversion dry run to provide feedback on which operators are not supported / fail the conversion to the users. Test Plan: * `pytest test/export/test_converter.py` Differential Revision: D59251934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129968 Approved by: https://github.com/angelayi	2024-07-05 18:04:29 +00:00
Animesh Jain	7ea8a3c9b8	[dynamo] Validate check_fn (#118448 ) Fixes - https://github.com/pytorch/pytorch/issues/128090 Tracker issue here - https://github.com/pytorch/pytorch/issues/129937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118448 Approved by: https://github.com/jansel, https://github.com/ezyang	2024-07-05 18:04:12 +00:00
Joel Schlosser	7192ee0735	Default to input tensor device for as_nested_tensor(t) (#130050 ) Fixes #129647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130050 Approved by: https://github.com/YuqingJ	2024-07-05 17:50:08 +00:00
Huy Do	a33ee73a28	Upload perf stats to both Rockset and dynamoDB (#129544 ) To avoid outage on HUD, I plan to migrate perf stats to dynamoDB as follows: 1. Upload perf stats to both Rockset and dynamoDB 2. Copy all the existing content from Rockset to dynamoDB 3. Create new Rockset tables to map to dynamoDB 4. Switch HUD to use the new Rockset tables (temporarily) 5. Delete the existing tables This depends on https://github.com/pytorch-labs/pytorch-gha-infra/pull/422 ### Testing ``` python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9770217910 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "gh/shunting314/162/head" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_" ... Writing 1607 documents to DynamoDB torchci-dynamo-perf-stats ``` And confirm the same number of documents is on the table ![Screenshot 2024-07-03 at 18 10 35](https://github.com/pytorch/pytorch/assets/475357/6c055c96-00ca-4cb3-bbe5-fe4914f9da9b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129544 Approved by: https://github.com/clee2000	2024-07-05 16:31:49 +00:00
James Wu	e7ab7b83bc	Have torch_key hash entire torch directory (#129250 ) Summary: Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files. There's an argument to be made to only hash .py and .cpp files. Maybe we can fix the glob to do that. We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources. Test Plan: Unit tests still pass on OSS For torch_key: ``` buck2 build caffe2:src_hash.txt -v 2 --show-output ``` See the output, then make any change to any torch file. See that the hash changes. Reviewed By: oulgen Differential Revision: D58875785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250 Approved by: https://github.com/oulgen	2024-07-05 15:37:16 +00:00
PyTorch MergeBot	eea4ece256	Revert "[audio hash update] update the pinned audio hash (#129429 )" This reverts commit 30fc4b06f55c7c4a915f938d7d5d6abbbc23bf61. Reverted https://github.com/pytorch/pytorch/pull/129429 on behalf of https://github.com/jeanschmidt due to pytorch bot should not have allowed this merge, as there are failing jobs ([comment](https://github.com/pytorch/pytorch/pull/129429#issuecomment-2210894639))	2024-07-05 13:38:44 +00:00
PyTorch MergeBot	4b05d9d233	Revert "[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 )" This reverts commit c9f1db265e317829b3a4d3af5be5c9266874dcd4. Reverted https://github.com/pytorch/pytorch/pull/129908 on behalf of https://github.com/jeanschmidt due to Seems to have introduced windows errors on main ([comment](https://github.com/pytorch/pytorch/pull/129908#issuecomment-2210888890))	2024-07-05 13:34:59 +00:00
Shunting Zhang	8f6765f7a7	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-05 10:26:39 +00:00
Shunting Zhang	c0735a3dd3	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-05 10:26:39 +00:00
Shunting Zhang	8f1c2e1e28	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-05 10:26:39 +00:00
Yu, Guangye	78a0b010eb	Refine XPU UTs (#130138 ) # Motivation 1. enable all test cases related to `TestXpu` running in XPU CI. 2. make `test_lazy_init` stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130138 Approved by: https://github.com/EikanWang	2024-07-05 09:56:22 +00:00
Jason Ansel	3240bff56a	[benchmarking] Add join_results.py (#129202 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129202 Approved by: https://github.com/yanboliang, https://github.com/shunting314	2024-07-05 06:55:30 +00:00
PyTorch UpdateBot	30fc4b06f5	[audio hash update] update the pinned audio hash (#129429 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429 Approved by: https://github.com/pytorchbot	2024-07-05 03:32:29 +00:00
Eddie Yan	c9f1db265e	[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 ) The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally: `a21d4363d2/c10/cuda/CUDAStream.h (L132)` OUTDATED below: The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following: ``` import logging import os import time import torch import torch.distributed as dist def main(): logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s") backend = 'nccl' group = torch.distributed.init_process_group(backend=backend) rank = torch.distributed.get_rank(group=group) for i in range(4): time.sleep(rank) logging.info(f"Rank {rank}: enter barrier {i}") dist.barrier() logging.info(f"Rank {rank}: exit barrier {i}") dist.destroy_process_group() if __name__ == "__main__": main() ``` appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead. The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization. This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device. CC @wujingyue @Aidyn-A @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908 Approved by: https://github.com/kwen2501	2024-07-04 20:36:58 +00:00
Lei Zhang	7128504424	[inductor] Add Triton template for Conv3D (#129518 ) This commit adds a Triton template for Conv3D ops, by following the same logic like Conv2D. Conv3D aren't as frequently used like Conv2D so they might enjoy less optimizations in various libraries. So having a Triton based inductor impl can improve performance for cases. Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129518 Approved by: https://github.com/jansel, https://github.com/jataylo	2024-07-04 20:30:50 +00:00
Kurt Mohler	e590168865	Enable sharing meta tensors between processes (#129520 ) Fixes #129436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129520 Approved by: https://github.com/ezyang	2024-07-04 20:29:48 +00:00
Xu Han	21eeedb455	[Inductor] Add aot_mode UT to new cpp_builder. (#130105 ) Changes: 1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT. 2. Switch AotCodeCompiler vec isa command gen to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-04 19:08:56 +00:00
chuanqiw	d496145534	[CD] Add triton xpu wheel build (#129730 ) Enable triton xpu wheel build firstly, then add pytorch xpu nightly wheel build Pull Request resolved: https://github.com/pytorch/pytorch/pull/129730 Approved by: https://github.com/atalman	2024-07-04 17:55:20 +00:00
Huy Do	f78b79daaa	Forward fix the missing torch.nn.Module.set_submodule from D59140215 (#130075 ) Summary: This is to forward fix D59140215 from a PyTorch open source contributor T194074371. On PyTorch side, we need to use isinstance instead of type when checking for nn.Module. This is the same way get_submodule is currently implemented. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/core/tests:module_test` Differential Revision: D59254638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130075 Approved by: https://github.com/mikaylagawarecki	2024-07-04 17:46:56 +00:00
Howard Huang	5b5f4b02c2	[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 ) # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369 Approved by: https://github.com/wconstab ghstack dependencies: #129368	2024-07-04 16:38:30 +00:00
PyTorch MergeBot	6dfa53ca76	Revert "[pt2-bench] pass acc test if ref is NaN (#129996 )" This reverts commit 51fa0bd436cf627bd0c8ccf3a3a8b9c07d260622. Reverted https://github.com/pytorch/pytorch/pull/129996 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	fa3953a2e1	Revert "[pt2-bench] fix accuracy failure for a few models (#129941 )" This reverts commit dafbd603ee6672d9592ec72b59300a2631f431d2. Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	54da35a2e0	Revert "[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 )" This reverts commit 0af8c8a981e79b05767089e57e81262dbbf2b1b4. Reverted https://github.com/pytorch/pytorch/pull/130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
Yu, Guangye	57d05f2616	[RELAND] Add xpu to getAccelerator (#129205 ) # Motivation Add `xpu` support to `getAccelerator`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205 Approved by: https://github.com/albanD, https://github.com/gujinghui ghstack dependencies: #129463	2024-07-04 10:26:52 +00:00
Yanbo Liang	551f3b92b2	[Dynamo] Add assertion for tensor unpack shape mismatch (#130077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130077 Approved by: https://github.com/Chillee	2024-07-04 09:25:08 +00:00
Yu, Guangye	f3962cfd9c	[RELAND] XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-07-04 08:46:34 +00:00
Animesh Jain	fa4e489d70	[dynamo][dynamic-shapes] Graph break if out shape changes on out= variants (#130074 ) Fixes https://github.com/pytorch/pytorch/issues/130068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130074 Approved by: https://github.com/ezyang ghstack dependencies: #129913, #129914	2024-07-04 08:36:12 +00:00
Yan Zhiwei	e98587c58d	Update torch-xpu-ops pin (ATen XPU implementation) (#129353 ) 188 new ATen operators/variants are added in the pin update, involving eager and torch.compile usage on HuggingFace, TIMM and TorchBench models. 16 new unit tests ported to enhance functionality coverage. Aligned source file directory structure with ATen native. Fixed corner case failures in aten::resize, aten::index_add and aten::index_put. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129353 Approved by: https://github.com/EikanWang	2024-07-04 07:36:17 +00:00
titaiwangms	bffb278700	[ONNX] Add `artifacts_dir` to torch-onnx-patch in benchmark (#130069 ) Add `artifacts_dir` to torch-onnx-patch to save error report for debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069 Approved by: https://github.com/justinchuby	2024-07-04 07:11:02 +00:00
Li-Huai (Allan) Lin	d62d351107	[Optim][BE] Change str(device) to _get_device_type(device) (#129984 ) Prevent using vague expressions like `"cuda" in str(device)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129984 Approved by: https://github.com/janeyx99 ghstack dependencies: #129451, #129552	2024-07-04 06:44:48 +00:00
Li-Huai (Allan) Lin	42f3d7e948	[MPS] Add mps profiler env vars to docs (#129552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129552 Approved by: https://github.com/malfet ghstack dependencies: #129451	2024-07-04 06:44:48 +00:00
cyy	07b06f0f0a	[2/N] Remove outdated CMake code (#130006 ) Follows #129851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130006 Approved by: https://github.com/drisspg	2024-07-04 06:24:22 +00:00
Jithun Nair	26be691e6b	Unify shard logic for inductor and dynamo test_config (#129508 ) Addresses https://github.com/pytorch/pytorch/pull/129480#issuecomment-2189954552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129508 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-07-04 06:04:29 +00:00
Anshul Sinha	9c9ac670a0	[dtensor][be] Reduced redundant LOC by creating functions to set up models used in example (#129613 ) Summary As the CommModeFeature example file grew, there were to many LOC that was repeated for setting up the models used. I created two functions, one to handle MLP and MLPStacked models and the other for transformer models. The output of the examples will not have changed. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/129613 Approved by: https://github.com/XilunWu ghstack dependencies: #129602	2024-07-04 06:00:58 +00:00
Anshul Sinha	0b9995c1ce	[dtensor][debug] Added forward and backward differentiation for module level tracing (#129602 ) Summary Currently, comm_mode only allowed users to differentiate between forward and backward passes at the operational level. I modified the code so that users can now see the collective counts for the passes at a module level. I decided to slightly change how the output was formatted making it easier to differentiate between a collective count and an operation. I have designed the operational trace table function so that in the future, a user can use command line arguments in order to determine the level of information they want to display instead of having two similar functions. Finally, I have updated the new output and test cases for comm_mode example and test files. The expected output for the first 3 examples are shown below: <img width="320" alt="Screenshot 2024-06-26 at 2 30 25 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b8e88075-a07f-4e84-b728-a08959df3661"> <img width="497" alt="Screenshot 2024-06-26 at 2 29 15 PM" src="https://github.com/pytorch/pytorch/assets/50644008/5ef4bea7-1355-4089-bfb0-c7e3f588ac77"> <img width="615" alt="Screenshot 2024-06-26 at 2 31 05 PM" src="https://github.com/pytorch/pytorch/assets/50644008/feacae51-76f7-403b-b6cd-dd15e981770e"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129602 Approved by: https://github.com/XilunWu, https://github.com/wz337	2024-07-04 06:00:58 +00:00
Peter Bell	e2e624a02f	[AOTAutograd] Micro-optimize runtime_wrapper (#128188 ) This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188 Approved by: https://github.com/bdhirsh	2024-07-04 03:53:06 +00:00
Animesh Jain	a7a7363be0	[dynamo] Skip side effect tracking for c wrappers/descriptors (#129914 ) Fixes PYTORCH_TEST_WITH_DYNAMO=1 pytest -vs test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_wrapper_subclass Pull Request resolved: https://github.com/pytorch/pytorch/pull/129914 Approved by: https://github.com/jansel ghstack dependencies: #129913	2024-07-04 03:14:45 +00:00
Animesh Jain	da8af685ac	[dynamo] Skip ID_MATCH guard on GetSetDescriptorType (#129913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129913 Approved by: https://github.com/jansel	2024-07-04 03:14:45 +00:00
Jiong Gong	8405ba21c1	[inductor][cpp] fix the vec convertion between float and int64 on AVX2 (#130013 ) Fix https://github.com/pytorch/pytorch/issues/129863 There is no single instruction support on AVX2 to convert between fp and int64 and has to be emulated. The original fast implementation (see https://stackoverflow.com/questions/41144668) assumes the data range is within [-2^51, 2^51]. The issue reported in https://github.com/pytorch/pytorch/issues/129863 has the input data outside this range and failed the test. This PR supports the full range of the conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130013 Approved by: https://github.com/lezcano	2024-07-04 03:01:49 +00:00
cyy	99ec7bbee7	Force inconsistent-missing-override for torch targets (#130010 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130010 Approved by: https://github.com/ezyang	2024-07-04 02:37:57 +00:00
Shunting Zhang	0af8c8a981	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-04 01:14:29 +00:00
Shunting Zhang	dafbd603ee	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-04 01:14:29 +00:00
Shunting Zhang	51fa0bd436	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-04 01:14:29 +00:00
drisspg	9108b74bbc	Updates to scaled_mm for rowwise scaling (#130059 ) # Summary This updates _scaled_mm's API to enforce that input scales are always 2 dimensional. This resolves ambiguity around scaling scheme Pull Request resolved: https://github.com/pytorch/pytorch/pull/130059 Approved by: https://github.com/vkuzo	2024-07-04 00:53:17 +00:00
Tristan Rice	cd70ac884f	c10d/Utils: better error message on 0 bytes (#130056 ) This improves the error messages on 0 bytes sent/received. We currently log it as a connection reset when it's caused by other reasons. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130056 Approved by: https://github.com/kurman, https://github.com/rsdcastro	2024-07-04 00:48:20 +00:00
cyy	efb73eda51	[2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878 ) Follows #128670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129878 Approved by: https://github.com/ezyang	2024-07-04 00:39:28 +00:00
Shangdi Yu	d95a019704	[export] construct empty graph when there's no tensor computation (#129541 ) Fixes [#127110](https://github.com/pytorch/pytorch/issues/127110). When input module does not contain any tensor computation, we would create a graph with inputs and outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129541 Approved by: https://github.com/angelayi	2024-07-04 00:26:17 +00:00
Shangdi Yu	2fe7c1fe04	[custom ops] Support factory function (#129978 ) Fixes #129389 If a user registers a device-specific implementation for an operator that accepts no Tensors, then we require the operator to have a "device: torch.device argument" We switch on the device argument to select the correct backend to dispatch to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129978 Approved by: https://github.com/zou3519	2024-07-04 00:10:52 +00:00
PyTorch MergeBot	779fc8119e	Revert "XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 )" This reverts commit 6353a12e6a80f06217645b10fb69cffeac08a8d0. Reverted https://github.com/pytorch/pytorch/pull/129463 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129463#issuecomment-2207529072))	2024-07-03 23:43:15 +00:00
PyTorch MergeBot	8a9725bedb	Revert "Add xpu to getAccelerator (#129205 )" This reverts commit 3e2df3ca9d0a593e09bc94c14bbf2b213413cbf3. Reverted https://github.com/pytorch/pytorch/pull/129205 on behalf of https://github.com/kit1980 due to Need to revert https://github.com/pytorch/pytorch/pull/129463 which breaks Meta builds ([comment](https://github.com/pytorch/pytorch/pull/129205#issuecomment-2207514346))	2024-07-03 23:37:24 +00:00
Jerry Zhang	a9a744e442	Change numeric_debug_handle to store per-node id (#129811 ) Summary: Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack, but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional support for numerical debugging for inputs and willing to hack around to achieve this. This PR changes the structure of numeric_debug_handle to store unique_id for each node instead. e.g. graph: ``` node = op(input_node, weight_node) ``` Before: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3} ``` After: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1 ``` Test Plan: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811 Approved by: https://github.com/tarun292	2024-07-03 22:03:31 +00:00
Zain Rizvi	b0d0114f5b	Enable automigration for windows jobs (#129977 ) Enable Windows jobs to automatically use LF runners when the author is opted-in Pull Request resolved: https://github.com/pytorch/pytorch/pull/129977 Approved by: https://github.com/clee2000	2024-07-03 22:02:56 +00:00
Yukio Siraichi	a79bb8db91	Make `_embedding_bag_backward` explicitly dispatch to CPU and CUDA. (#129691 ) This PR modifies `_embedding_bag_backward` item inside _native_functions.yaml_, so that it dispatches to CPU and CUDA directly, instead of `CompositeImplicitAutograd`. Context: PyTorch operations that have the `CompositeImplicitAutograd` dispatch do not allow third party backends (e.g. XLA) to modify its implementation, since this dispatch key has higher priority. When calling `_embedding_bag_backward` operation using XLA, a dispatch error will be thrown, since PyTorch/XLA doesn't support sparse tensors. Problem: `_embedding_bag_backward` has a `sparse` parameter that controls whether the operation should return a sparse or dense tensor. However, at the moment, PyTorch/XLA does not support sparse tensors. In order to fallback that execution to dense, i.e. change the flag at runtime, we need to be able to modify its implementation. Solution: we have changed the dispatch of `_embedding_bag_backward` to CPU and CUDA, which allowed us to introduce our own kernel for it. Additionally, this PR refactored the representation of its mode from constant integers into an enum class. It also introduces two additional operators: `int == EmbeddingBagMode` and `int != EmbeddingBagMode`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129691 Approved by: https://github.com/lezcano	2024-07-03 21:54:49 +00:00
rzou	7bbd6cf931	[custom_ops] Mark older custom ops prototypes as deprecated (#130032 ) I've had at least one person try to call APIs from here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130032 Approved by: https://github.com/yushangdi, https://github.com/williamwen42	2024-07-03 21:11:05 +00:00
Shivam Raikundalia	a21d4363d2	[Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP (#129973 ) Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP Test Plan: Ran resnet. Trace looks good https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces Reviewed By: aaronenyeshi, swolchok Differential Revision: D59132793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973 Approved by: https://github.com/aaronenyeshi	2024-07-03 19:28:52 +00:00
Zhengxu Chen	042d764872	[export] Update example inputs format for DB. (#129982 ) Summary: To give user a simpler example code, we are getting rid of ExportArgs in favor of example_args and example_kwargs. Test Plan: CI Differential Revision: D59288920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129982 Approved by: https://github.com/angelayi	2024-07-03 17:53:15 +00:00
Brian Hirsh	9b902b3ee3	AOTI: dont treat views of buffers as constants (#129688 ) More context [here](https://github.com/pytorch/pytorch/issues/129682#issuecomment-2195463838), but this change was enough to get this AOTI + float8 repro running for me (below). Previously, it would fail an assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_meta_registrations.py#L5387) at inductor lowering time. It looks like during lowering, we were supposed to pass `param.transpose(1, 0)` as the second argument to the scaled_mm kernel. But in the inductor IR, this object is a `ReinterpretView` with `get_name()` equal to one of the param constants, so we would end up passing the constant directly into the kernel, instead of performing the view first. I'm not totally sure if this is the right place to make the change, so interested in any thoughts from inductor folks (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @eellison ) ``` import torch from torch.export import export from torch.export._trace import _export # Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # # This source code is licensed under the BSD 3-Clause license found in the # LICENSE file in the root directory of this source tree. import copy import io import random import unittest import pytest import torch import torch.nn as nn import torch.nn.functional as F from float8_experimental.float8_dynamic_linear import Float8DynamicLinear from float8_experimental.float8_linear_utils import swap_linear_with_float8_linear from float8_experimental.float8_tensor import Float8Tensor from float8_experimental.float8_utils import compute_error random.seed(0) torch.manual_seed(0) is_H100 = torch.cuda.is_available() and torch.cuda.get_device_capability() >= (9, 0) import torch.nn.utils.parametrize as parametrize # NOTE: we should upstream this directly into export and make it more automatic! class UnwrapTensorSubclass(torch.nn.Module): def forward(self, tensors): todo = list(tensors) for tp, meta, inner_tensors in reversed(self.rebuild_stack): nb_tensor = len(inner_tensors) inner_tensors = {a: b for a, b in zip(inner_tensors, todo[-nb_tensor:])} todo = todo[nb_tensor:] rebuilt = tp.__tensor_unflatten__(inner_tensors, meta, None, None) todo.append(rebuilt) assert len(todo) == 1 return todo[0] def right_inverse(self, tensor): assert type(tensor) is not torch.Tensor rebuild_stack = [] plain_tensors = [] todo = [tensor] while todo: obj = todo.pop() inner_tensors, metadata = obj.__tensor_flatten__() rebuild_stack.append((type(obj), metadata, inner_tensors)) for attr_name in inner_tensors: val = getattr(obj, attr_name) if type(val) is torch.Tensor: plain_tensors.append(val) else: assert isinstance(val, torch.Tensor) todo.append(val) self.rebuild_stack = rebuild_stack return plain_tensors def unwrap_tensor_subclass(model, filter_fn=None): for name, child in model.named_children(): if ( isinstance(child, Float8DynamicLinear) and hasattr(child, "weight") and type(child.weight) is not torch.Tensor and isinstance(child.weight, torch.Tensor) ): parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass()) unwrap_tensor_subclass(child) return model class FeedForward(nn.Module): def __init__(self) -> None: super().__init__() self.w1 = nn.Linear(4096, 14336, bias=False) self.w3 = nn.Linear(4096, 14336, bias=False) self.w2 = nn.Linear(14336, 4096, bias=False) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.w2(F.silu(self.w1(x)) self.w3(x)) def reset_parameters(self): for m in self.modules(): if isinstance(m, nn.Linear): m.reset_parameters() export_model = FeedForward().to("cuda") swap_linear_with_float8_linear( export_model, Float8DynamicLinear, from_float_kwargs={"pre_quantize_weight": True}, ) export_model = unwrap_tensor_subclass(export_model) batch_size = 4 num_tokens = 1024 embedding_dim = 4096 input_tensor = torch.randn( batch_size, num_tokens, embedding_dim, device="cuda", dtype=torch.float32 ) example_args = (input_tensor,) # NOTE: this breaks unless we use strict=False, pre_dispatch=False! exported_program: torch.export.ExportedProgram = _export( export_model, example_args, strict=False, pre_dispatch=False, ) with torch.no_grad(): so_path = torch._inductor.aot_compile(exported_program.module(), example_args) print(so_path) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129688 Approved by: https://github.com/eellison	2024-07-03 17:24:08 +00:00
Edward Z. Yang	35600bcaad	Print float with full precision, don't truncate (#130027 ) Fixes https://github.com/pytorch/pytorch/issues/119338 Exercised in https://github.com/pytorch/pytorch/pull/118448 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130027 Approved by: https://github.com/lezcano, https://github.com/Skylion007	2024-07-03 17:20:19 +00:00
chilli	01e41f1814	Modified autotuning for flex_attention to pass in (proper) fake inputs for the block sparse entries (#129915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129915 Approved by: https://github.com/yanboliang, https://github.com/eellison ghstack dependencies: #129846, #129950	2024-07-03 17:08:45 +00:00
chilli	e2eb33b089	Added methods to blockmask to visualize them (#129950 ) <img width="319" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/319b10f4-f6fe-4ff8-9529-d366ff411b95"> <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/27a8953a-3c50-4922-b5d0-4ea5630a133a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129950 Approved by: https://github.com/yanboliang, https://github.com/drisspg ghstack dependencies: #129846	2024-07-03 17:08:45 +00:00
Edward Z. Yang	29c68df600	Stop immediately specializing common constants 0/1 for plain int (#128327 ) Fixes https://github.com/pytorch/pytorch/issues/128319 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327 Approved by: https://github.com/lezcano ghstack dependencies: #129983	2024-07-03 16:41:51 +00:00
James Wu	9e1e58e052	Support allowlisted modules and op overloads in AOTAutogradCache (#128329 ) Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329 Approved by: https://github.com/bdhirsh	2024-07-03 14:59:24 +00:00
Edward Z. Yang	64a04d2225	Make sparse empty constructors specialize instead of fail on symbolic inputs (#129983 ) Exercised in https://github.com/pytorch/pytorch/pull/128327 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129983 Approved by: https://github.com/anijain2305	2024-07-03 13:27:19 +00:00
Xuehai Pan	735044191f	[Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884 ) The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`. `5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)` Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string. ```python ", ".join(string.split(",")) ``` ```python # before def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ... # after def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884 Approved by: https://github.com/ezyang	2024-07-03 11:45:24 +00:00
Huy Do	8f70bf7a94	Skip TestSDPAPrivateUse1Only on FBCODE (#129997 ) Summary: The test is from D59181111, but I couldn't figure out a way to make it pass on FBCODE because loading PyTorch C++ extension requires Ninja which is not going to work with BUCK Test Plan: `buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:transformers` Differential Revision: D59304327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129997 Approved by: https://github.com/drisspg	2024-07-03 06:48:51 +00:00
Valentine233	62b710782d	change LayoutLMForSequenceClassification inference accuracy tolerance (#129728 ) Fixes #128510. https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff. The test log: Single thread ``` correct_result: SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000 E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits fail_accuracy ``` Multiple threads ``` correct_result: SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-03 06:28:27 +00:00
Jason Ansel	4fc9157e90	[halide-backend] Disable split reductions for Halide (#129320 ) In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129321	2024-07-03 05:56:40 +00:00
Jason Ansel	0abcca85b7	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314	2024-07-03 05:56:40 +00:00
Edward Z. Yang	8af58f66bb	Fix typo in floordiv solver code that affects flipped relation (#129888 ) Fixes https://github.com/pytorch/pytorch/issues/123535 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888 Approved by: https://github.com/lezcano	2024-07-03 04:47:32 +00:00
Edward Z. Yang	424cd1e1df	Enable TORCH_TRACE by default on Conda on Mast (#129988 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129988 Approved by: https://github.com/kunalb	2024-07-03 03:35:45 +00:00
Catherine Lee	1026b0f687	Use setup-miniconda step from test-infra for llm retrival workflow (#129720 ) Undo https://github.com/pytorch/pytorch/pull/129722 Use the setup-miniconda step in written in test-infra to install miniconda in the llm retrieval workflow. It comes with a cache so we don't have to worry about hitting cache limits. The llm retrieval job was failing due to too many requests https://github.com/pytorch/pytorch/issues/129718#issue-2379260544 `2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129720 Approved by: https://github.com/PaliC, https://github.com/malfet, https://github.com/huydhn	2024-07-03 03:02:23 +00:00
chilli	31fc5b8966	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-03 02:34:03 +00:00
Tristan Rice	9ee8c18309	TCPStore: add ping to verify network connectivity on connect (#129985 ) This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent. This adds support for PING to both the libuv and legacy backend. Example error: ``` [trainer85612\|12]:W0701 13:41:43.421574 4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer [trainer85612\|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first): ... [trainer85612\|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637 [trainer85612\|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868 [trainer85612\|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775 ``` Test plan: ``` python test/distributed/test_store.py -v ``` ``` tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py starting pool started 90000 started 30000 started 70000 started 20000 started 80000 started 60000 started 0 [W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. init 20000 set 20000 init 80000 set 80000 init 70000 set 70000 init 60000 set 60000 init 30000 set 30000 init 90000 set 90000 started 40000 init 40000 set 40000 started 50000 init 50000 set 50000 started 10000 init 10000 set 10000 init 0 set 0 run finished 617.2992351055145 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985 Approved by: https://github.com/rsdcastro, https://github.com/kurman	2024-07-03 02:09:44 +00:00
Catherine Lee	91a8376d47	run_test: Unset cpp stacktraces after reruns (#129004 ) Rerun the failing test singly with the env var set. If it succeeds, start a new process without the cpp stack traces env var We don't want to waste time generating these if we don't have to They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these Adds new --rs (run single) to be used the same way --scs and --sc are. It will only run the single test in the step current file https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs In the above: * test_checkpoint_valid failed, then passed in another subprocess. The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free) * test_format_traceback_short failed consistently, but it continued to run because keep-going was set Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004 Approved by: https://github.com/PaliC	2024-07-03 01:50:15 +00:00
xinan.lin	c77c139878	[Intel Triton] Update Intel Triton to resolve installation issue on manylinux. (#129847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129847 Approved by: https://github.com/Skylion007, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #129782	2024-07-03 01:46:32 +00:00
dilililiwhy	c686304277	Enable UFMT on test/test_public_bindings.py (#128389 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > test/test_public_bindings.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389 Approved by: https://github.com/malfet	2024-07-03 01:43:41 +00:00
xinan.lin	3b77b122c5	[Inductor UT] update rtol for convoluton on XPU. (#129782 ) [Inductor UT] update rtol for convoluton on XPU. Fix https://github.com/pytorch/pytorch/issues/129974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129782 Approved by: https://github.com/atalman	2024-07-03 01:37:16 +00:00
Shiyan Deng	1e27af335e	[easy] enhance local model loading (#129897 ) Summary: 1. add one more model lib dep. 2. add error message when torchscript failed to find a class in python compilation unit. Test Plan: CI Reviewed By: jingsh Differential Revision: D59243250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129897 Approved by: https://github.com/jingsh	2024-07-03 00:29:02 +00:00
Simon Fan	be2d79a16b	[dynamic] config to disable duck sizing (#129804 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129804 Approved by: https://github.com/ezyang	2024-07-03 00:20:54 +00:00
Yanbo Liang	111f9b5d44	[Dynamo] Add config to skip/inline torchrec (#129912 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912 Approved by: https://github.com/anijain2305	2024-07-03 00:14:51 +00:00
PyTorch MergeBot	89646ebb11	Revert "[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 )" This reverts commit 4b8a5e03745924c8f987dc072fa4d41f4cb6f103. Reverted https://github.com/pytorch/pytorch/pull/129680 on behalf of https://github.com/kit1980 due to breaking internal builds, see D59181183 ([comment](https://github.com/pytorch/pytorch/pull/129680#issuecomment-2204737227))	2024-07-03 00:03:50 +00:00
Peter Bell	921c116089	[inductor] Kill mark_node_as_mutating (#129346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129346 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325, #129343, #129344	2024-07-02 23:50:07 +00:00
Peter Bell	b2ac8d2af3	[inductor] Use multiple outputs for flex-attention (#129344 ) This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/129344 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325, #129343	2024-07-02 23:50:07 +00:00
Peter Bell	45844e0d4e	[inductor] Add FileCheck to flex attention epilogue test (#129343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325	2024-07-02 23:50:04 +00:00
Peter Bell	7955cd3e83	[inductor] Make UserDefinedTritonKernel a multi-output operation (#129325 ) Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129325 Approved by: https://github.com/lezcano ghstack dependencies: #128893	2024-07-02 23:50:00 +00:00
Peter Bell	fb078c20c1	[inductor] Separate Buffer and Operation into two concepts (#128893 ) Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893 Approved by: https://github.com/lezcano	2024-07-02 23:49:57 +00:00
rzou	872d972e41	[custom_op] better error message on no returns (#129896 ) I run into this a lot. I can imagine that it would look opaque to users, so made it more friendly Old error message: "ValueError: infer_schema(func): Return has unsupported type <class 'inspect._empty'>." Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129896 Approved by: https://github.com/yushangdi	2024-07-02 23:34:23 +00:00
Shangdi Yu	aa0352ca38	[custom ops] add default value support for device types (#129792 ) Fixes #129371 I think the first case in Issue #129371 is already supported in the current code? Since it takes care of string default values. This PR adds support for device type default values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129792 Approved by: https://github.com/zou3519	2024-07-02 23:31:29 +00:00
Edward Z. Yang	d7680a564b	Bug fixes for disabling 0/1 specialization on plain int (#129961 ) These bug fixes will be exercised in https://github.com/pytorch/pytorch/pull/128327 but I separate them from the actual policy change (which is more risky) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129961 Approved by: https://github.com/lezcano	2024-07-02 23:19:48 +00:00
eqy	29ffa20bb1	[CUDA] Bump tolerances for `test_grad_pca_lowrank` (#129902 ) The revert of #127199 seems to surface an additional failure on A100---small tolerance bump to account for this. I did find what appears to be a race condition in the one of the kernels used in this workload but I'm not sure it's related here... CC @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/129902 Approved by: https://github.com/ezyang	2024-07-02 23:17:02 +00:00
PyTorch MergeBot	b5fdbc1a9f	Revert "[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 )" This reverts commit ec789a3c9ddd4e550b3dea6934ce2d41deb98784. Reverted https://github.com/pytorch/pytorch/pull/129369 on behalf of https://github.com/clee2000 due to broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 `ec789a3c9d`. You can see the error on the PR, but Dr. CI classified it wrong ([comment](https://github.com/pytorch/pytorch/pull/129369#issuecomment-2204568418))	2024-07-02 22:30:53 +00:00
Sheng Fu	b6f781e433	Bug fix for captuing execution trace grid function (#129832 ) Summary: The inputs to grid function are varying argument, it can be one number, two numbers, or three numbers. The current implementation captured it as a tuple. For example "grid((16,))". The fix is to change it to varying number of elements. In the previous example, it is changed to "grid(16,)". PARAM et-replay code will be modified to reflect this change in a following up DIFF. Test Plan: buck2 test mode/dev-nosan caffe2/test:profiler -- -- test_execution_trace_with_pt2 Differential Revision: D59195933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129832 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2024-07-02 22:23:57 +00:00
Colin Peppler	39357ba06f	[dynamo] don't constrain range on the replacement for a symbol (#129907 ) # Error ``` File "/data/users/colinpeppler/pytorch/torch/_meta_registrations.py", line 704, in sym_constrain_range constrain_range(size, min=min, max=max) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 898, in constrain_range a.node.shape_env._constrain_range(a.node.expr, min, max) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/recording.py", line 245, in wrapper return fn(args, *kwargs) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 2813, in _constrain_range assert isinstance(a, sympy.Symbol), f"constraining non-Symbols NYI, {a} is {type(a)}" torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: constraining non-Symbols NYI, s1 + s2 is <class 'sympy.core.add.Add'> ``` # Context I ran into the following scenario: ``` getitem = ... sym_size_int = torch.ops.aten.sym_size.int(getitem, 0) # this is u0 = s0 + s1 _check_is_size = torch._check_is_size(sym_size_int) # we fail at this guy sym_constrain_range_default = torch.ops.aten.sym_constrain_range.default(sym_size_int, min = 4, max = 1234) # runtime assertion add = sym_size_int + sym_size_int_1 eq = add == sym_size_int _assert_scalar_default = torch.ops.aten._assert_scalar(eq, "Runtime assertion failed for expression Eq(s0 + s1, u0) on node 'eq'") ``` everything but getitem was asserted into the FX graph by insert_deferred_runtime_asserts() `7e4329c258/torch/fx/passes/runtime_assert.py (L38-L52)` In the above scenario, we fail trying to constraint the range on `s0 + s1` which is not a `sympy.Symbol`. And why exactly are we constraining the range on `s0 + s1`? Because it's the replacement for `u0`. # Approach Whenever we try to constrain the range on the replacement of ~~an unbacked symint~~ a non-symbol, just ignore it. In the scenario above, we'll be okay to ignore it because whenever there's a replacement on an unbacked symint, we will update its range. Hence, no need to constrain the range on `s1 + s1`. We can confirm this with `TORCH_LOGS="+dynamic"`. ``` torch/fx/experimental/symbolic_shapes.py:4737: _update_var_to_range u0 = VR[4, 198] (update) torch/fx/experimental/symbolic_shapes.py:4856: set_replacement u0 = s1 + s2 (trivial_lhs) VR[4, 198] ``` `600bf978ba/torch/fx/experimental/symbolic_shapes.py (L4759-L4764)` Differential Revision: [D59257079](https://our.internmc.facebook.com/intern/diff/D59257079) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129907 Approved by: https://github.com/jingsh	2024-07-02 21:46:40 +00:00
PyTorch MergeBot	c22e66896f	Revert "Fix typo in floordiv solver code that affects flipped relation (#129888 )" This reverts commit 3c6c3b94486d49614bae5e76e7bd6b9579f643d4. Reverted https://github.com/pytorch/pytorch/pull/129888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the updated test starts to fail flakily in trunk somehow, so I am reverting the change to see if it helps ([comment](https://github.com/pytorch/pytorch/pull/129888#issuecomment-2204442653))	2024-07-02 21:16:59 +00:00
wz337	1ddb100318	[FSDP1][Easy] Remove Spammy Log Lin in _runtime_utils.py (#129967 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129967 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/Skylion007	2024-07-02 21:08:57 +00:00
PyTorch UpdateBot	deefc10dd3	[executorch hash update] update the pinned executorch hash (#129428 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129428 Approved by: https://github.com/pytorchbot	2024-07-02 20:39:39 +00:00
cyy	26de2c2487	[3/N] Enable clang-tidy on torch/csrc/jit/serialization/* (#129850 ) Follows #129300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129850 Approved by: https://github.com/ezyang	2024-07-02 20:08:48 +00:00
Li-Huai (Allan) Lin	8ec5ba960f	[MPS] Add tensor_lr overloads to fused adam & adamw (#129451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451 Approved by: https://github.com/janeyx99	2024-07-02 19:46:30 +00:00
Edward Z. Yang	2631a96f2a	Stop updating hints (#129893 ) Some profiling suggests that the repeated maybe evaluate static calls are expensive. Ref: https://github.com/pytorch/pytorch/issues/123964 With test script: ``` import torch import torch._dynamo.config torch._dynamo.config.capture_scalar_outputs = True @torch.compile(fullgraph=True) def f(a, b): xs = b.tolist() for x in xs: torch._check_is_size(x) torch._check(x <= 20) return a.split(xs) N = 20 splits = torch.randint(10, (N,)) sz = splits.sum().item() f(torch.randn(sz), splits) ``` Before: ``` real 0m18.526s user 0m16.555s sys 0m11.031s ``` After: ``` real 0m13.831s user 0m12.152s sys 0m10.941s ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129893 Approved by: https://github.com/lezcano	2024-07-02 19:24:33 +00:00
Anshul Sinha	1f6c1fcd36	[dtensor][debug] add operation tracing to comm_mode (#129017 ) Summary I have added an even more detailed module tracker that now includes the collective counts and operations that happen in each submodule making it easier for users to debug. The tracing now includes the operation's DTensor arguements' input shape and sharding. Like the module collective tracing, the user also has the option to log the tracing table to output.txt file. I have decided not to include the example output for transformer as it is too many lines. The expected output for the MLP_operation_tracing is shown below: <img width="574" alt="Screenshot 2024-06-25 at 3 33 16 PM" src="https://github.com/pytorch/pytorch/assets/50644008/a09e2504-19d5-4c69-96e8-f84e852d7786"> <img width="467" alt="Screenshot 2024-06-25 at 3 33 45 PM" src="https://github.com/pytorch/pytorch/assets/50644008/55c07d2d-6cb6-410f-82ac-2849bb7bfbbb"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/129017 Approved by: https://github.com/XilunWu	2024-07-02 19:05:05 +00:00
Huy Do	bf05ea2bab	Re-generate Linux build workflows after #124014 (#129976 ) This looks like a landrace as lint passed on #124014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129976 Approved by: https://github.com/kit1980	2024-07-02 18:57:20 +00:00
Yanbo Liang	080149cb38	[Inductor][FlexAttention] Add helper functions of converting score_mod to block_mask (#129909 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129909 Approved by: https://github.com/Chillee, https://github.com/drisspg ghstack dependencies: #129831, #129859	2024-07-02 18:48:16 +00:00
Yanbo Liang	1f3e2d7877	[Inductor] Rename TemplatedAttention to FlexAttention (#129859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129859 Approved by: https://github.com/Chillee, https://github.com/drisspg ghstack dependencies: #129831	2024-07-02 18:48:16 +00:00
Michael Lazos	aa7ea6b45c	Add wraps back (#129933 ) Fixes https://github.com/pytorch/pytorch/issues/129922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129933 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-07-02 18:24:02 +00:00
Howard Huang	ec789a3c9d	[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 ) # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369 Approved by: https://github.com/wconstab ghstack dependencies: #129368	2024-07-02 18:19:28 +00:00
Howard Huang	4eb449f7dc	[pipelining] add small logging section to docs (#129368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129368 Approved by: https://github.com/wconstab	2024-07-02 18:19:28 +00:00
Yanbo Liang	34e94c507a	[Inductor] Make FlexAttention block_mask argument as tuple (#129831 ) Re-organize ```block_mask``` related arguments a tuple to reduce the individual argument number. I was trying to use named tuple, but aot autograd doesn't work well with named tuple. The only downside of using tuple rather than named tuple is we need to use index to access its element. But we only need this at one place, it should be fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129831 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-07-02 17:18:33 +00:00
Animesh Jain	9105d54c6b	[dynamo][sparse] Graph break on sparse tensors (#129883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129883 Approved by: https://github.com/ezyang ghstack dependencies: #129830, #129858, #129857, #129881	2024-07-02 16:51:56 +00:00
Animesh Jain	75443d3daf	[dynamic-shapes] Dont create symbol if .item() is a nan (#129881 ) Passes ` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/torch_np/numpy_tests/lib/test_function_base.py::TestInterp::test_scalar_interpolation_point` in the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129881 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #129830, #129858, #129857	2024-07-02 16:51:56 +00:00
Nikita Shulga	d146a62e77	[MPS][BE] Introduce `mtl_setBytes` (#129910 ) Which for primitive types calls `[encoder setBytes:&val legnth:sizeof(val) index:idx];` and for container types passes number of elements equal to the size of the container Pull Request resolved: https://github.com/pytorch/pytorch/pull/129910 Approved by: https://github.com/Skylion007	2024-07-02 16:36:57 +00:00
Shangdi Yu	9fb2dec7a6	[custom ops] Add unknown arg (#129614 ) Fixes #129372 Add a mutated_args="unknown" that pessimistically assumes that all inputs to the operator are being mutates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129614 Approved by: https://github.com/zou3519	2024-07-02 16:10:14 +00:00
Tijmen Blankevoort	e3b3431c42	Fix for HistogramObserver (#129387 ) Summary: There were two problems with the HistogramObserver: 1. It does not work when someone passes a batch_size 1, tensor_size 1 data-point. 2. The Histogram doesn't seem to actually update if the range of the new x falls within the old one These issues were both fixed. On top of this, I greatly simplified the logic for the histogram updating. Now, it doesn't do the downsampling anymore, which saves a ton of memory and code. The accuracy can still be controlled with the upsampling ratio. This ratio was also too high for the accuracy we generally need here, I reduced the default for this. Also the code is cleaner now, much easier to follow what's happening. test_histogram_observer_same_inputs was likely wrong - If I pass 0s and 1s to my histogramobserver, I want them to actually count! The current test now thinks it's good to discard and ignore these values. Test Plan: You can run the included tests. Differential Revision: D58931336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129387 Approved by: https://github.com/jerryzh168	2024-07-02 15:41:44 +00:00
PyTorch MergeBot	03440a1c13	Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846 )" This reverts commit badc638eb68c0b07ae3b857e885e6d0137b218aa. Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))	2024-07-02 15:25:34 +00:00
Aart Bik	3fd128361e	[traced-graph][sparse] add relay override for layout_impl (#129930 ) In the "layout()" method of "TensorImpl" defined in the file core/TensorImpl.h, the following code and documentation can be found: ``` Layout layout() const { ... if .. { ... } else if (is_sparse_compressed()) { // Typically, the tensor dispatch keys define the tensor layout // uniquely. This allows using non-virtual layout method for // better performance. However, when tensor's layout depends, // say, on tensor attributes, one must use this execution path // where the corresponding tensor impl class overwrites virtual // layout_impl() method. return layout_impl(); } else { ... } } ``` However, this override was never implemented. This PR put the override in place, to prepare for sparsity propagation in another PR. https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129930 Approved by: https://github.com/ezyang	2024-07-02 15:24:34 +00:00
Edward Z. Yang	dacc33d2fa	Make sym_min/sym_max handle Numpy scalars (#129917 ) Internal xref: https://fb.workplace.com/groups/1069285536500339/posts/7773876449374514/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129917 Approved by: https://github.com/Skylion007	2024-07-02 14:59:20 +00:00
Xuehai Pan	f1df13f023	[BE][Easy] Fix `PYI001`: unprefixed-type-param in `torch/utils/data/datapipes` (#129885 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129885 Approved by: https://github.com/ezyang	2024-07-02 14:56:27 +00:00
Joel Schlosser	257b9c7936	Fix layout for _like() factories on NJTs (#129879 ) Background: this bug was triggering DEBUG=1 asserts in the backward for `unbind()`, which calls `empty_like()`. I found that the NJT implementation of `empty_like()` was redispatching on `values` while blindly passing along all kwargs. This resulted in `empty_like(values, ..., layout=torch.jagged)`, which is incorrect since `values` is strided, tripping the debug assert here: `433b691f98/aten/src/ATen/EmptyTensor.cpp (L305)` This PR explicitly sets `layout=torch.strided` when redispatching `_like()` factories on `values`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129879 Approved by: https://github.com/soulitzer	2024-07-02 14:51:23 +00:00
Aaron Gokaslan	6c2a8b6b38	[Ez][BE]: Enable new stable ruff rules (#129825 ) Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825 Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet	2024-07-02 14:47:10 +00:00
Xu Han	2926655761	[inductor] optimize cpp builder configuration code (#129577 ) Changes: 1. Combine choose isa condition dispatch code. 2. Unificate MacOS openmp configuration code. 3. Clean up useless code. Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-02 14:41:59 +00:00
Aaron Gokaslan	6cb0ad3375	[BE]: Update NCCL submodule to 2.21.5 (#124014 ) Update NCCL to the latest version. This release is mostly bugfixes with a few new minor features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124014 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman	2024-07-02 14:39:33 +00:00
Peter Bell	dc75ec252a	[inductor] Fix can_merge check for expr=q0q1 (#129806 ) Fixes #111884 In the minimised reproducer, we have a loop with the index expression `-q0q1` for which in the merge tester we get: ``` expr1 = - 0 * (_merge_tester * 16) = 0 expr2 = - _merge_tester * 0 = 0 ``` so it decides we can merge the dimensions and `q0` is set to `0`, meaning `-q0q1` is always zero! Here I change the test so we have at least one case where no zeros are substituted so we can catch this situation. In the normal strided case we get e.g. ``` expr = 16 q0 + q1 expr1 = 16 * _merge_tester2 + (16 * _merge_tester1) expr2 = 16 * (_merge_tester2 + _merge_tester1) ``` which are still equivalent expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129806 Approved by: https://github.com/lezcano	2024-07-02 14:30:02 +00:00
leslie-fang-intel	37e3c60897	[Inductor][CPP] Remove redundant INT8-specific logic in the INT8 GEMM template (#129470 ) Summary Remove redundant INT8-specific logic in the INT8 GEMM template to unify the code structure with FP32/BF16/FP16 GEMM Template. Test Plan ``` numactl -C 56-111 -m 1 python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129470 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103, #129220, #129221	2024-07-02 13:15:15 +00:00
leslie-fang-intel	b6379591a9	[Inductor][CPP] Pass weight dtype explicitly for cpp gemm template (#129221 ) Summary This PR mainly refactor 2 things: 1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered. 2. Add an util function to get the output data type and compute data type from input data type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129221 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049, #129103, #129220	2024-07-02 13:06:32 +00:00
leslie-fang-intel	72fa864098	[Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM (#129220 ) Summary Add the AMX micro gemm kernel with int8 data type. Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_amx ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [✓] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129220 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103	2024-07-02 12:53:35 +00:00
leslie-fang-intel	a796358330	[Inductor][CPP] Enable Quantized Linear GEMM Template with Binary Fusion (#129103 ) Summary Based on previous PR, add the config to support quantized linear binary - optional(unary) post op fusion. - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with binary and optional[Unary] post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise_binary ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129103 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049	2024-07-02 12:45:10 +00:00
leslie-fang-intel	86e2d16ba0	[Inductor][Quant] Change the schema of QLinear Binary (#129049 ) Summary We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template. - Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar. - We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion. Test Plan ``` python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048	2024-07-02 12:36:38 +00:00
PyTorch MergeBot	07450e9713	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit 6240cfd5c751bea6ca91dc765085e1d871b22345. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))	2024-07-02 12:29:51 +00:00
Fuzzkatt	0441173ab2	Add slowTest marker to test_linalg_solve_triangular_large (#129903 ) In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903 Approved by: https://github.com/lezcano	2024-07-02 12:27:12 +00:00
Jack Taylor	95a5958db4	[ROCm] Update nightly triton-rocm pin to release branch (#129361 ) Update pin to tip of https://github.com/triton-lang/triton/commits/release/3.0.x/ following upstream strategy here https://github.com/pytorch/pytorch/pull/126098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129361 Approved by: https://github.com/peterbell10	2024-07-02 11:49:52 +00:00
Edward Z. Yang	3c6c3b9448	Fix typo in floordiv solver code that affects flipped relation (#129888 ) Fixes https://github.com/pytorch/pytorch/issues/123535 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888 Approved by: https://github.com/lezcano	2024-07-02 11:15:03 +00:00
Edward Z. Yang	8ef8240172	Don't mark conversion to float as is_integer = False (#129890 ) Zero is an integer, so if you say is_integer = False, you are also saying the result cannot be zero, which is undesirable. This is exercised by next PR in the stack. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129890 Approved by: https://github.com/lezcano	2024-07-02 11:08:09 +00:00
Edward Z. Yang	eb1ff76f23	Make are_strides_like_channels_last size oblivious (#129677 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129677 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #129869	2024-07-02 11:05:20 +00:00
Edward Z. Yang	ebeeb22669	Correctly put mark_unbacked symbols in shape_env_to_source_to_symbol_cache (#129869 ) Internal xref: https://www.internalfb.com/intern/anp/view/?source=version_selector&id=5534845 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129869 Approved by: https://github.com/albanD	2024-07-02 11:05:20 +00:00
Xu Han	567dd1a3ca	[inductor] unificate toolchain code. (#129816 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789 Changes: 1. Unificate cpp builder's toolchain code. 2. Move all build related code to `cpp_builder.py`. 3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816 Approved by: https://github.com/jansel	2024-07-02 09:52:06 +00:00
chilli	badc638eb6	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-02 09:31:38 +00:00
awayzjj	ccc4ee7793	check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839 ) Fixes https://github.com/pytorch/pytorch/issues/127043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129839 Approved by: https://github.com/lezcano	2024-07-02 09:20:49 +00:00
Jeff Willette	5c9d5272e4	fixes #124582 (#128483 ) added check for existence of outputs requiring grad to make_graphed_callables. added new test case, updated existing test case to include parameterless modules. Fixes #124582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-02 08:45:59 +00:00
Haoci Zhang	1ad683033b	Implemented flexible PP schedule (#129597 ) Enabled some cases to work where num_microbatches % pp_size != 0. Using the flex_pp schedule, we will have num_rounds = max(1, n_microbatches // pp_group_size) and it works as long as n_microbatches % num_rounds is 0. As a few examples, support pp_group_size = 4, n_microbatches = 10. We will have num_rounds = 2 and n_microbatches % 2 is 0. pp_group_size = 4, n_microbatches = 3. We will have num_rounds = 1 and n_microbatches % 1 is 0. Moved over from PiPPy (https://github.com/pytorch/PiPPy/pull/1129) Tested using the config in (1), schedule looks like the following graph: ``` =========== ALL_RANK_ACTIONS =========== Rank 0 Rank 1 Rank 2 Rank 3 Step 00: F0_s0 None None None Step 01: F1_s0 F0_s1 None None Step 02: F2_s0 F1_s1 F0_s2 None Step 03: F3_s0 F2_s1 F1_s2 F0_s3 Step 04: F4_s0 F3_s1 F2_s2 F1_s3 Step 05: F0_s4 F4_s1 F3_s2 F2_s3 Step 06: F1_s4 F0_s5 F4_s2 F3_s3 Step 07: F2_s4 F1_s5 F0_s6 F4_s3 Step 08: F3_s4 F2_s5 F1_s6 F0_s7 Step 09: F4_s4 F3_s5 None B0_s7 Step 10: F5_s0 None F2_s6 F1_s7 Step 11: None None B0_s6 B1_s7 Step 12: None F4_s5 F3_s6 F2_s7 Step 13: None B0_s5 B1_s6 B2_s7 Step 14: F6_s0 F5_s1 F4_s6 F3_s7 Step 15: B0_s4 B1_s5 B2_s6 B3_s7 Step 16: F7_s0 F6_s1 F5_s2 F4_s7 Step 17: B1_s4 B2_s5 B3_s6 B4_s7 Step 18: F8_s0 F7_s1 F6_s2 F5_s3 Step 19: B2_s4 B3_s5 B4_s6 B0_s3 Step 20: F9_s0 F8_s1 F7_s2 F6_s3 Step 21: B3_s4 B4_s5 B0_s2 B1_s3 Step 22: F5_s4 F9_s1 F8_s2 F7_s3 Step 23: B4_s4 B0_s1 B1_s2 B2_s3 Step 24: F6_s4 F5_s5 F9_s2 F8_s3 Step 25: B0_s0 B1_s1 B2_s2 B3_s3 Step 26: F7_s4 F6_s5 F5_s6 F9_s3 Step 27: B1_s0 B2_s1 B3_s2 B4_s3 Step 28: F8_s4 F7_s5 F6_s6 F5_s7 Step 29: B2_s0 B3_s1 B4_s2 B5_s7 Step 30: F9_s4 F8_s5 F7_s6 F6_s7 Step 31: B3_s0 B4_s1 B5_s6 B6_s7 Step 32: None F9_s5 F8_s6 F7_s7 Step 33: B4_s0 B5_s5 B6_s6 B7_s7 Step 34: None None F9_s6 F8_s7 Step 35: B5_s4 B6_s5 B7_s6 B8_s7 Step 36: None None None F9_s7 Step 37: B6_s4 B7_s5 B8_s6 B9_s7 Step 38: None None None None Step 39: B7_s4 B8_s5 B9_s6 B5_s3 Step 40: None None None None Step 41: B8_s4 B9_s5 B5_s2 B6_s3 Step 42: None None None None Step 43: B9_s4 B5_s1 B6_s2 B7_s3 Step 44: None None None None Step 45: B5_s0 B6_s1 B7_s2 B8_s3 Step 46: None None None None Step 47: B6_s0 B7_s1 B8_s2 B9_s3 Step 48: None None None Step 49: B7_s0 B8_s1 B9_s2 Step 50: None None Step 51: B8_s0 B9_s1 Step 52: None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129597 Approved by: https://github.com/H-Huang	2024-07-02 07:54:38 +00:00
Yu, Guangye	3e2df3ca9d	Add xpu to getAccelerator (#129205 ) # Motivation Add `xpu` support to `getAccelerator`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205 Approved by: https://github.com/albanD, https://github.com/gujinghui ghstack dependencies: #129463	2024-07-02 06:48:24 +00:00
Yu, Guangye	6353a12e6a	XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-07-02 06:48:24 +00:00
Xu Han	76259ebfdd	[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1 Changes: 1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`. <img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92"> 2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa. 3. Update code for above changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-02 05:29:05 +00:00
Jovian Anthony Jaison	f6edd1f7c9	[BE] Make ActivationWrapper an abstract class (#129808 ) Fixes #95481 Test Plan: Unit tested checkpoint_wrapper.py by instantizing ActivationWrapper and got TypeError as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129808 Approved by: https://github.com/Skylion007	2024-07-02 04:29:43 +00:00
PyTorch MergeBot	c2d0b7b96d	Revert "[ROCm] std::clamp work-around for hip-clang compiler (#127812 )" This reverts commit 8c2c3a03fb87c3568a22362d83b00d82b9fb3db2. Reverted https://github.com/pytorch/pytorch/pull/127812 on behalf of https://github.com/ezyang due to windows trunk job failing ([comment](https://github.com/pytorch/pytorch/pull/127812#issuecomment-2201653245))	2024-07-02 01:52:31 +00:00
Kulin Seth	6240cfd5c7	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-07-02 01:49:52 +00:00
Howard Huang	600bf978ba	[Pipelining] Add to/from CSV format and improved __repr__ (#129264 ) _Action.__repr__ gets rearranged so it doesn't require an underscore or a 's' prefix, but still keeps multi-digit stage and microbatch indices separated by an alpha character indicating the action type. to/from CSV methods allow dumping a generated schedule to CSV format for offline visualization or manual editing in a spreadsheet and reloading to use at runtime. Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129264 Approved by: https://github.com/H-Huang	2024-07-02 01:26:23 +00:00
wz337	83e6ec2ccd	[FSDP2+TP] Disable 2D state_dict (#129519 ) Fixes #ISSUE_NUMBER Gonna fill in the RFC but just want to run CI to see if anything else breaks. Test: ``` python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_raise_not_implemented_state_dict_if_2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129519 Approved by: https://github.com/awgu	2024-07-02 01:26:14 +00:00
cyy	46366888d7	Remove outdated CMake code (#129851 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129851 Approved by: https://github.com/ezyang	2024-07-02 00:40:37 +00:00
Nikita Shulga	7e4329c258	[EZ][BE] Bump min cmake version to 3.18 (#129906 ) As this is a min CMake version supported by top level PyTorch Hides ``` CMake Deprecation Warning at aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt:7 (cmake_minimum_required): Compatibility with CMake < 3.5 will be removed from a future version of CMake. Update the VERSION argument <min> value or use a ...<max> suffix to tell CMake that the project does not need compatibility with older versions. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129906 Approved by: https://github.com/kit1980	2024-07-01 23:06:49 +00:00
Zain Rizvi	9645eaaaec	[BE] Improve logging for runner-determinator (#129679 ) This lets us be more flexible about what data we output and throwing exceptions. It's also less likely to break when others make changes (e.g. any print statement would have broken this code before since the printed output was expected to only be a json) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129679 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt, https://github.com/Skylion007	2024-07-01 22:31:35 +00:00
soulitzer	eeef68671d	[autograd] Do not detach when unpacking tensors that do not require grad (#127959 ) In this PR: - Ensure that if a tensor not requiring grad is saved for backward unpacking does not trigger a detach (unless the user installs a saved tensor pack hook that returns a tensor requiring grad). - Update non-reentrant checkpoint to also no longer detach for this case. Alternatives: - For custom autograd Function, you could directly save on ctx to work around this, but that would not work for when we switch to using custom ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127959 Approved by: https://github.com/YuqingJ ghstack dependencies: #125795, #128545, #129262	2024-07-01 21:57:36 +00:00
Jithun Nair	87693b534c	[ROCm] Use AOTriton as a dynamic library (#129094 ) This PR enables using AOTriton as a shared library dependency instead of a static one. Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094 Approved by: https://github.com/malfet	2024-07-01 21:39:27 +00:00
Jeff Daily	8c2c3a03fb	[ROCm] std::clamp work-around for hip-clang compiler (#127812 ) Fixes #127666. Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max for USE_ROCM builds. Patch comes from @lamikr. Modified to use #ifndef USE_ROCM. https://github.com/lamikr/rocm_sdk_builder/pull/37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 21:00:33 +00:00
Andres Lugo-Reyes	750c701e49	[ROCm] Update xlogy comment detailing issue (#128151 ) update skip reason comment with more accurate descriptor Pull Request resolved: https://github.com/pytorch/pytorch/pull/128151 Approved by: https://github.com/zou3519	2024-07-01 20:58:58 +00:00
Animesh Jain	78cda9a810	[symbolic-shapes] Add FloatPow in the symbolic shape guard closure (#129857 ) Fixes test failure raised in the next diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129857 Approved by: https://github.com/ezyang ghstack dependencies: #129830, #129858	2024-07-01 20:44:59 +00:00
Animesh Jain	53d67165c0	[dynamo] Skip FUNCTION_MATCH guards for descriptors (#129858 ) Hard to write tests. This PR makes many test pass in the stack such as `PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_ao_sparsity.py::TestComposability::test_convert_without_squash_mask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129858 Approved by: https://github.com/mlazos ghstack dependencies: #129830	2024-07-01 20:44:59 +00:00
Jithun Nair	f86dbae247	Fix typo in lxml requirement (#129695 ) Extra period at the end throws off pip: ``` root@f04177cab5af:/data/pytorch# pip install -r .ci/docker/requirements-ci.txt ERROR: Invalid requirement: 'lxml==5.0.0.': Expected end or semicolon (after version specifier) lxml==5.0.0. ~~~~~~~^ (from line 309 of .ci/docker/requirements-ci.txt) ``` Not sure why CI docker builds do not have an issue with this period. Typo comes from `f73b1b9388` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129695 Approved by: https://github.com/huydhn	2024-07-01 19:43:37 +00:00
Huy Do	fdd0a7f9b4	Run test_mps_allocator_module serially (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-07-01 18:44:48 +00:00
PyTorch MergeBot	b02186ffc1	Revert "Allow get attributes on DDP similar to FSDP (#128620 )" This reverts commit 065c386990dce444db17eff7b254bf79e82450ef. Reverted https://github.com/pytorch/pytorch/pull/128620 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if the trunk error on inductor is fixed ([comment](https://github.com/pytorch/pytorch/pull/128620#issuecomment-2200717876))	2024-07-01 17:57:00 +00:00
Hao Dong	bb0f3df562	Fix index issues in torch.fx.interpreter (#129527 ) Summary: Fix index issues in torch.fx.interpreter by changing range from `[:i]` to `[:i+1]`. Because if there are `n` elements, the last index `i` of the `for` loop is `n-1` and `[:i]` can only get access to elements from index `0` to index `n-2` and miss the last element. `[:i+1]` can get access to all elements correctly. Test Plan: Test with Node API Differential Revision: D59028395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129527 Approved by: https://github.com/dulinriley	2024-07-01 17:46:13 +00:00
zhangfeiv0	1956d87c1f	Increase riscv implementation in DepthwiseConvKernel (#127867 ) Summary: Increase riscv implementation in DepthwiseConvKernel. Compile: export USE_CUDA=0 export USE_DISTRIBUTED=0 export USE_MKLDNN=0 export MAX_JOBS=4 export CMAKE_CXX_COMPILER=clang++ export CMAKE_C_COMPILER=clang export CMAKE_C_FLAGS=-march=rv64gcv export CMAKE_CXX_FLAGS=-march=rv64gcv python3 setup.py develop --cmake Test Plan: Correctness - Check the results of the run before and after test_convolution.py python3 test/run_test.py --include nn/test_convolution --keep-going Before: ===== 9 passed, 13 skipped, 564 deselected in 46.55s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 After: ===== 9 passed, 13 skipped, 564 deselected in 48.13s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 Performance - Compare the results before and after mobilenet_v2 python3 run.py mobilenet_v2 -d cpu -t eval Before: Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 19590.647 milliseconds CPU Wall Time: 19590.647 milliseconds Time to first batch: 5271.3518 ms CPU Peak Memory: 0.3809 GB After: Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 13523.530 milliseconds CPU Wall Time: 13523.530 milliseconds Time to first batch: 2696.0304 ms CPU Peak Memory: 0.3408 GB Versions: Clang version: 17.0.2 Platform: CanMV-K230 Architecture: riscv64 OS: Ubuntu 23.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127867 Approved by: https://github.com/malfet	2024-07-01 17:11:21 +00:00
PyTorch MergeBot	c9dc9887db	Revert "Enable UFMT on test/test_public_bindings.py (#128389 )" This reverts commit fe5424d0f8604f6e66d827ae9f94b05cb7119d55. Reverted https://github.com/pytorch/pytorch/pull/128389 on behalf of https://github.com/clee2000 due to broke test_mps.py::TestMPS::test_mps_allocator_module? https://github.com/pytorch/pytorch/actions/runs/9730750763/job/26854426294 `fe5424d0f8` Not sure how this change can do that. Build failed on PR so test didn't run ([comment](https://github.com/pytorch/pytorch/pull/128389#issuecomment-2200589719))	2024-07-01 16:34:04 +00:00
PyTorch MergeBot	433b691f98	Revert "[inductor] optimize cpp builder configuration code (#129577 )" This reverts commit 2e3ff394bf94d3b9cbab0fe8a93a9ea7c9cb4267. Reverted https://github.com/pytorch/pytorch/pull/129577 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D59181128 ([comment](https://github.com/pytorch/pytorch/pull/129577#issuecomment-2200554824))	2024-07-01 16:14:06 +00:00
PyTorch MergeBot	19e17216a2	Revert "[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 )" This reverts commit 58f346c874a8a982679b4d4f3876602cc05d66d4. Reverted https://github.com/pytorch/pytorch/pull/129789 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/129577 ([comment](https://github.com/pytorch/pytorch/pull/129789#issuecomment-2200545144))	2024-07-01 16:08:44 +00:00
PyTorch MergeBot	b6dc37bb4e	Revert "[inductor] unificate toolchain code. (#129816 )" This reverts commit 67c9ec2b6d12ffd0e83861dcc16c1cd1a9b74d35. Reverted https://github.com/pytorch/pytorch/pull/129816 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #129577 ([comment](https://github.com/pytorch/pytorch/pull/129816#issuecomment-2200539687))	2024-07-01 16:06:22 +00:00
cyy	ca5d13c672	[1/N] Enable unused variable warnings on torch_cpu and fix some violations (#128670 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128670 Approved by: https://github.com/ezyang	2024-07-01 14:56:46 +00:00
PyTorch MergeBot	e385bf8ef8	Revert "[halide-backend] Disable split reductions for Halide (#129320 )" This reverts commit a18eb651d352e45860a96869abaf9fb7b215eac6. Reverted https://github.com/pytorch/pytorch/pull/129320 on behalf of https://github.com/jeanschmidt due to This PR is breaking internal builds, please check comments on it D59204360 ([comment](https://github.com/pytorch/pytorch/pull/129320#issuecomment-2200351678))	2024-07-01 14:44:35 +00:00
PyTorch MergeBot	a83eaf1c3a	Revert "[halide-backend] Support manual schedules (#129321 )" This reverts commit 9ae78a578caff195821ad535a9e8d8ef59552142. Reverted https://github.com/pytorch/pytorch/pull/129321 on behalf of https://github.com/jeanschmidt due to Reverting, as it is required to do so in order to revert #129320 ([comment](https://github.com/pytorch/pytorch/pull/129321#issuecomment-2200345664))	2024-07-01 14:42:33 +00:00
Xu Zhao	cc9b005bf2	Enable torchao nightly workflow (#129779 ) Summary: Make the following improvements: * Schedule the torchao benchmark nightly * Enable torchbench, timm, and huggingface models * Refactor the benchmarking script to better arrange the benchmarking groups Test workflow: https://github.com/pytorch/benchmark/actions/runs/9705589352 X-link: https://github.com/pytorch/benchmark/pull/2336 Differential Revision: D59074571 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129779 Approved by: https://github.com/jerryzh168	2024-07-01 14:28:38 +00:00
Xuehai Pan	75f64e1203	Fix test `test_type_hints.py::TestTypeHints::test_doc_examples` (#129829 ) As per the title, this test was broken for months. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129829 Approved by: https://github.com/ezyang	2024-07-01 13:28:37 +00:00
Jack Taylor	e1b426b345	[ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650 ) Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 11:40:09 +00:00
cyy	313eec02cc	Add hash function of std::string_view to torch/csrc/lazy/core/hash.h (#128800 ) For easier moving of c10::string_view to std::string_view in PyTorch/XLA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128800 Approved by: https://github.com/ezyang	2024-07-01 09:53:34 +00:00
Ramana Cherukuri	f6a0be5023	Add warpSize to Device properties (#128449 ) Adding warp_size to CudaDeviceProperties. >>> import torch >>> prop = torch.cuda.get_device_properties(torch.cuda.current_device()) >>> prop.warp_size 64 >>> @jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449 Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/jithunnair-amd, https://github.com/malfet	2024-07-01 09:13:32 +00:00
Nikita Shulga	04a0d85620	[BE] Print all pip packages installed on the system after TorchChat (#129809 ) To make debugging regressions like ones happened last Wed when new version of torchao was released, that resulted in TorchBench downgrading pytorch version to 2.3.1 Test plan: Look at the log output for example https://github.com/pytorch/pytorch/actions/runs/9720408234/job/26832794157?pr=129809#step:20:1158 contains ``` + echo 'Print all dependencies after TorchBench is installed' Print all dependencies after TorchBench is installed + python -mpip freeze absl-py==2.1.0 accelerate==0.31.0 aiohttp==3.9.5 aiosignal==1.3.1 astunparse==1.6.3 async-timeout==4.0.3 attrs==23.2.0 audioread==3.0.1 beautifulsoup4==4.12.3 boto3==1.19.12 botocore==1.22.12 bs4==0.0.2 cachetools==5.3.3 certifi==2024.6.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129809 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-07-01 04:51:53 +00:00
cyy	eb1583dbc1	[2/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129300 ) Follows #129055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129300 Approved by: https://github.com/ezyang	2024-07-01 01:09:00 +00:00
Animesh Jain	e62073d799	[dynamo] Skip FUNCTION_MATCH on method-wrapper objects (#129830 ) Fixes https://github.com/pytorch/pytorch/issues/118563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129830 Approved by: https://github.com/jansel	2024-06-30 20:21:18 +00:00
eqy	24b6c5a41f	[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 ) Fix for #129579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-06-30 19:37:44 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
eqy	7b0e9a27ba	Restore `allowed_info` in OOM message when applicable (#129546 ) Seems to be removed following #99699? Pull Request resolved: https://github.com/pytorch/pytorch/pull/129546 Approved by: https://github.com/Skylion007	2024-06-30 17:22:32 +00:00
Eddie Yan	8755e035d2	[CUDA][Pooling] Fix 64-bit indexing in `avg_pool_2d` backward attempt 2 (#129818 ) Somehow the original PR was missing the `CUDA_KERNEL_LOOP_TYPE` change??? Thanks @johnc-keen @Chillee for the great repro! (#129785) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129818 Approved by: https://github.com/Chillee, https://github.com/Skylion007	2024-06-30 16:52:33 +00:00
eqy	4dd3cff234	[CUDA] Fix more `DeviceIndex` printing (#128540 ) Same `char` dtype causing device index `0` to be interpreted as a null-terminator, see also #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128540 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-06-30 16:44:14 +00:00
eqy	68484621fe	[cuDNN][functorch] Bump tolerances for `nn.functional.conv2d` in `test_vmap_autograd_grad` (#129796 ) Newer versions of cuDNN can dispatch to a winograd kernel here on A100 which affects numerics a bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129796 Approved by: https://github.com/Skylion007	2024-06-30 16:36:12 +00:00
Weizhuo Zhang	fff633f087	[CI] Enable AOT inductor FP32 accuracy test for CPU (#129040 ) This PR enabled AOT inductor backend FP32 accuracy check for CPU in CI workflow, which could catch AOT inductor issue at early stage. Test Time cost: \| Suite \| Precision \| Time cost \| \|------------- \|----------- \|----------- \| \| Huggingface \| FP32 \| 1h12m \| \| Timm models \| FP32 \| 1h32m \| \| Torchbench \| FP32 \| 1h40m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/129040 Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/malfet	2024-06-30 14:00:09 +00:00
Randolf Scholz	8a5fda0377	added type hints for __contains__ (#129653 ) - Fixes #129646 - Added test in test/typing/reveal/tensor_constructors.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129653 Approved by: https://github.com/ezyang	2024-06-30 11:49:11 +00:00
leslie-fang-intel	1a689ea38c	[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op (#129048 ) Summary Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129048 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825	2024-06-30 09:53:55 +00:00
leslie-fang-intel	35a197defa	[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output (#128825 ) Summary Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Next Step - [ ] Unary post op fusion - [ ] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/128825 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-30 09:45:43 +00:00
dilililiwhy	fe5424d0f8	Enable UFMT on test/test_public_bindings.py (#128389 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > test/test_public_bindings.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389 Approved by: https://github.com/ezyang	2024-06-30 08:49:51 +00:00
Xuehai Pan	4ee1cb9b95	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-30 01:36:07 +00:00
PyTorch MergeBot	2effbcfcd8	Revert "[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )" This reverts commit 6d75604ef135925e8c85363c2f4a5e0b6f7fef28. Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))	2024-06-29 23:24:06 +00:00
Xu Han	67c9ec2b6d	[inductor] unificate toolchain code. (#129816 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789 Changes: 1. Unificate cpp builder's toolchain code. 2. Move all build related code to `cpp_builder.py`. 3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816 Approved by: https://github.com/jansel	2024-06-29 23:21:13 +00:00
leslie-fang-intel	3fec0efd34	[Inductor][CPP] Support vectorization of bitwise fn (#129733 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: bitwise_and`. In this PR, we add vectorization support of 6 bitwise functions. In this PR, we also remove `bitwise_xor` from `ops_to_bool` list which sets output data type as bool in data type propagation. It seems wrong since according to this doc https://pytorch.org/docs/stable/generated/torch.bitwise_xor.html, it should return the same integral data type with input and the testcase `test_bitwise3` failed due to this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_bitwise python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_bitwise3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129733 Approved by: https://github.com/jgong5, https://github.com/Skylion007	2024-06-29 17:25:27 +00:00
Xuehai Pan	6d75604ef1	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-29 15:42:09 +00:00
Xuehai Pan	7837a12474	[BE] enforce style for empty lines in import segments (#129751 ) This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet: > Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one) `usort` allows empty lines within import segments. For example, `usort` do not change the following code: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style: 1. no empty lines within segments. 2. single empty line between segments. 3. two spaces after import statements. All the code snippets above will be formatted to: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` which produces a consistent code style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751 Approved by: https://github.com/malfet	2024-06-29 14:15:24 +00:00
Jason Ansel	9ae78a578c	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314 ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320	2024-06-29 14:06:28 +00:00
Jason Ansel	a18eb651d3	[halide-backend] Disable split reductions for Halide (#129320 ) In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506, #129036	2024-06-29 14:06:28 +00:00
Jason Ansel	4cb8cb04a7	[halide-backend] Enable bfloat16 support (#129036 ) Requires https://github.com/halide/Halide/pull/8255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129036 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506	2024-06-29 14:06:25 +00:00
Jason Ansel	b93bf55b6a	[halide-backend] Add GPU support (#127506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026	2024-06-29 14:06:21 +00:00
Jason Ansel	86cadc6385	[halide-backend] Dimension-based indexing (#129026 ) Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs. Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues. This PR infers dimensions and changes the indexing in the generated code. Before ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 1) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 xindex = hl.Var('xindex') rindex = hl.Var('rindex') r1 = rindex x0 = xindex idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)]) odom = hl.RDom([hl.Range(0, 16)]) rdom = hl.RDom([hl.Range(0, 32)]) xindex_idom = idom.x xindex_odom = odom.x rindex_idom = idom.y r1_idom = rindex_idom x0_idom = xindex_idom x0_odom = xindex_odom tmp0 = hl.Func('tmp0') tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)] tmp1 = hl.Func('tmp1') tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex]) tmp2 = hl.Func('tmp2') tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex] tmp3 = hl.Func('tmp3') tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex]) tmp4 = hl.Func('tmp4') tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex]) tmp5 = hl.Func('tmp5') tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex] out_ptr3_i0 = hl.Var('out_ptr3_i0') out_ptr3_i1 = hl.Var('out_ptr3_i1') out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1]) assert g.using_autoscheduler() in_ptr0.set_estimates([hl.Range(0, 512)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` After ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 2) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 h0 = hl.Var('h0') h1 = hl.Var('h1') rdom = hl.RDom([hl.Range(0, 32)]) hr1 = rdom[0] tmp0 = hl.Func('tmp0') tmp0[h0, h1] = in_ptr0[h0, h1,] tmp1 = hl.Func('tmp1') tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1]) tmp2 = hl.Func('tmp2') tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1] tmp3 = hl.Func('tmp3') tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1]) tmp4 = hl.Func('tmp4') tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1]) tmp5 = hl.Func('tmp5') tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1] out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1]) assert g.using_autoscheduler() in_ptr0.dim(0).set_min(0) in_ptr0.dim(0).set_stride(1) in_ptr0.dim(0).set_extent(32) in_ptr0.dim(1).set_min(0) in_ptr0.dim(1).set_stride(32) in_ptr0.dim(1).set_extent(16) in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129026 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025	2024-06-29 14:06:16 +00:00
Jason Ansel	da5f37515e	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-29 14:06:12 +00:00
Jason Ansel	e34b7e6af3	[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-29 14:06:08 +00:00
Howard Huang	13d4be1dc7	[pipelining] Support W action for schedules (#129233 ) Add support to for the `W` action in `_step_microbatches`. ## TODO: - Clean up the tests theres a lot of copy-pasted repeated code there Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129233 Approved by: https://github.com/wconstab ghstack dependencies: #128983, #128976	2024-06-29 11:51:40 +00:00
Howard Huang	a6da01bd01	[pipelining] Support arbitrary stage ordering on ranks (#128976 ) Fixes based on discussion in https://github.com/pytorch/pytorch/issues/128665 Our previous assumption was that for looped schedules `stage_ids = range(rank, total_stages, num_local_stages)`. This is not true for all schedules. This change relaxes that assumptions and allows arbitrary ordering of stages. For example in the added test we do, rank 0: [stage0, stage3], rank 1: [stage1, stage2]. The test also adds a schedule registry (for testing) which performs 1 microbatch through this schedule ``` F0_0 None None F0_3 B0_3 None None B0_0 None F0_1 F0_2 None None B0_2 B0_1 None ``` Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128976 Approved by: https://github.com/wconstab ghstack dependencies: #128983	2024-06-29 11:51:39 +00:00
Will Constable	18ae3bab2f	[Pipelining] Support separate dw_runner for PipelineStage (#128983 ) Fixes #128974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128983 Approved by: https://github.com/H-Huang	2024-06-29 11:51:34 +00:00
谭九鼎	b0e5c9514d	use shutil.which in check_compiler_ok_for_platform (#129069 ) the same as https://github.com/pytorch/pytorch/pull/126060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129069 Approved by: https://github.com/ezyang	2024-06-29 11:38:51 +00:00
Xuehai Pan	56935684c3	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-29 09:23:39 +00:00
Xuehai Pan	9120992c72	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-29 09:23:39 +00:00
Xuehai Pan	8a67daf283	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-29 09:23:35 +00:00
Xu Han	58f346c874	[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1 Changes: 1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`. <img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92"> 2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa. 3. Update code for above changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-29 07:19:54 +00:00
Animesh Jain	a676b7c5f3	Add XGLMForCausalLM to the flaky model list (#129776 ) Not failing on devGPU. Went to CI machine ... flaky. So adding to the flaky list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129776 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610, #129775	2024-06-29 05:47:28 +00:00
Animesh Jain	5d1763d159	Add lcnet to the inline_inbuilt_nn_module list (#129775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610	2024-06-29 05:47:28 +00:00
Wenlei He	89696db4b0	Revert "[LLVM/TensorExpr] Update for an API change in LLVM 18." (#129797 ) This reverts commit 20f394f10a389bcf13485929be8862f98ad4b322 (https://github.com/pytorch/pytorch/pull/117086) LLVM upstream changed the pass builder API again, so registerPassBuilderCallbacks no longer takes extra boolean for PopulateClassToPassNames. Update accordingly. Relevant LLVM upstream change: https://github.com/llvm/llvm-project/pull/96321 https://github.com/llvm/llvm-project/pull/96462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129797 Approved by: https://github.com/dcci	2024-06-29 05:17:20 +00:00
Boyuan Feng	3ef44df667	[ts-migration] support prim::SetAttr and fix prim::GetAttr (#129440 ) - Lifting Tensor Constant attributes to buffers: TorchScript does not automatically lift tensor constant attributes to buffers. So previous converter cannot access tensor constant attributes. This PR fixed the issue. - Add SetAttr support for tensor attributes by copy_. - Add SetAttr support for non-tensor attributes. In particular, we maintain the current value of non-tensor attributes in `name_to_non_tensor_attribute_node`, similar to an interpreter pass on non-tensor attributes. So we can support the following use case: ```python def forward(self, x): c1 = self.count self.count += 1 c2 = self.count return x + c1 + c2 ``` - Fixed a bug in GetAttr to support the following use case: ```python def forward(self, inp): x = self.buffer self.buffer += 1 y = self.buffer return x + y + inp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129440 Approved by: https://github.com/angelayi	2024-06-29 05:08:13 +00:00
Yanbo Liang	ec47d4d9a8	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-29 04:44:38 +00:00
Yanbo Liang	7b5a8424a1	[GPT-fast] Update micro benchmark numbers as A100-50G (#129799 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129799 Approved by: https://github.com/Chillee	2024-06-29 04:36:07 +00:00
Mayank Mishra	065c386990	Allow get attributes on DDP similar to FSDP (#128620 ) FSDP implements the following logic but its missing from DDP. This PR adds an equivalent function for the same. ```python def __getattr__(self, name: str) -> Any: """Forward missing attributes to the wrapped module.""" try: return super().__getattr__(name) # defer to nn.Module's logic except AttributeError: return getattr(self._fsdp_wrapped_module, name) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128620 Approved by: https://github.com/awgu	2024-06-29 01:57:22 +00:00
Nikita Shulga	2bc6f329b2	Make PyTorch argparser understand complex (#129580 ) It understands float and int, so why not `complex`. Test plan: `python -c "import torch;print(torch.rand(3, dtype=complex))"` Fixes https://github.com/pytorch/pytorch/issues/126837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129580 Approved by: https://github.com/albanD	2024-06-29 01:21:12 +00:00
PyTorch MergeBot	dfd55d1714	Revert "[cond] inlining into one of the branches when pred is a python constant (#128709 )" This reverts commit 23adf166e166bd56e3446284939af7e46a181079. Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking one ExecuTorch test ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2197806850))	2024-06-29 01:03:55 +00:00
PyTorch MergeBot	3d96217891	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 9e1f3ecaa710785a1ab03c6ad5093a5566d6c5e5. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))	2024-06-29 00:47:15 +00:00
Haiping Zhao	c0782e7c81	Kineto profiler: collecting observer traces from C++ child threads (#128743 ) Summary: In a C++ program, if we have child threads doing GPU work, it would be nice to get traces of those threads as well. The problem is, pushProfilingCallbacks() is not called on child threads, therefore, no observer traces are collected on these threads, entirely missing in the final output. This diff provides a new API that a child thread may elect to call to register itself onto the profiler that was started in main thread (or whatever the Python thread that manages the profiler). Test Plan: ``` buck2 test @mode/opt //caffe2/test:profiler_test_cpp_thread ``` Reviewed By: aaronenyeshi Differential Revision: D56669942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128743 Approved by: https://github.com/aaronenyeshi	2024-06-29 00:44:30 +00:00
PyTorch MergeBot	a32ce5ce34	Revert "[BE][Easy] enable postponed annotations in `tools` (#129375 )" This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0. Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
PyTorch MergeBot	6063bb9d45	Revert "[BE][Easy] enable postponed annotations in `torchgen` (#129376 )" This reverts commit 494057d6d4e9b40daf81a6a4d7a8c839b7424b14. Reverted https://github.com/pytorch/pytorch/pull/129376 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
PyTorch MergeBot	83caf4960f	Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )" This reverts commit e40f50cb87bcd176a380b729af5dda13dbe9c399. Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:24 +00:00
PyTorch MergeBot	00d7bba2fa	Revert "[BE] enforce style for empty lines in import segments (#129751 )" This reverts commit f5ff1a3ab9ef279655308266029faf6543a8a1ca. Reverted https://github.com/pytorch/pytorch/pull/129751 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129751#issuecomment-2197799814))	2024-06-29 00:41:41 +00:00
PyTorch MergeBot	fa6c0fe3e4	Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749 )" This reverts commit 9450e198aa0bdf6f81ccb8ad2f74c06e81d1af6e. Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2197790226))	2024-06-29 00:16:47 +00:00
Andrew Gu	24f69eef6a	[FSDP2] Ran reduce-scatter copy-in in default stream (#129721 ) This PR runs the reduce-scatter copy-in in the default stream, allowing the reduce-scatter input (large allocation proportional to unsharded gradients) to be allocated in the default stream to avoid fragmenting that memory across stream memory pools. - In general, minimizing memory usage spikes in non-default-stream memory pools helps because otherwise, that memory cannot be reused by the default stream outside of that spike. This reduce-scatter input allocation represents one such spike. The reduce-scatter outputs are still allocated in the separate `reduce_scatter` stream since they are small and have a non-spiky allocation/free pattern (we iteratively allocate them through backward and free them altogether after optimizer). - This PR should not have any impact on overlap (I sanity checked Llama3-8B traces from torchtitan; plus we have the `test_fully_shard_overlap.py` unit tests). Experiment (Before) Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1: ``` [rank0]:2024-06-27 16:38:56,620 - root - INFO - step: 1 loss: 12.2764 memory: 71.99GiB(75.75%) wps: 1,436 mfu: 8.41% [rank0]:2024-06-27 16:38:56,620 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-06-27 16:38:57,943 - root - INFO - step: 2 loss: 12.1001 memory: 79.82GiB(83.98%) wps: 6,195 mfu: 36.28% [rank0]:2024-06-27 16:38:59,266 - root - INFO - step: 3 loss: 11.7697 memory: 79.82GiB(83.98%) wps: 6,193 mfu: 36.27% [rank0]:2024-06-27 16:39:00,587 - root - INFO - step: 4 loss: 11.2807 memory: 79.82GiB(83.98%) wps: 6,203 mfu: 36.32% [rank0]:2024-06-27 16:39:01,910 - root - INFO - step: 5 loss: 10.9494 memory: 79.82GiB(83.98%) wps: 6,198 mfu: 36.30% ``` (After) Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1: ``` [rank0]:2024-06-27 16:41:12,106 - root - INFO - step: 1 loss: 12.2560 memory: 69.46GiB(73.08%) wps: 1,158 mfu: 6.78% [rank0]:2024-06-27 16:41:12,106 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-06-27 16:41:13,502 - root - INFO - step: 2 loss: 12.0949 memory: 77.29GiB(81.32%) wps: 5,870 mfu: 34.37% [rank0]:2024-06-27 16:41:14,839 - root - INFO - step: 3 loss: 11.7770 memory: 77.29GiB(81.32%) wps: 6,130 mfu: 35.90% [rank0]:2024-06-27 16:41:16,154 - root - INFO - step: 4 loss: 11.3188 memory: 77.29GiB(81.32%) wps: 6,230 mfu: 36.48% [rank0]:2024-06-27 16:41:17,474 - root - INFO - step: 5 loss: 10.9443 memory: 77.29GiB(81.32%) wps: 6,211 mfu: 36.37% ``` 2.53 GiB reduction in peak reserved memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129721 Approved by: https://github.com/weifengpy, https://github.com/yifuwang	2024-06-28 23:55:12 +00:00
Sahan Paliskara	f06e3a1569	[Split Build] Make script not crash if split build is not set (#129774 ) Fixes issue causing https://github.com/pytorch/pytorch/actions/runs/9704484834/job/26801889463 to crash Pull Request resolved: https://github.com/pytorch/pytorch/pull/129774 Approved by: https://github.com/atalman	2024-06-28 23:50:18 +00:00
Aaron Gokaslan	7bda23ef84	[BE]: Update ruff to 0.5.0 (#129744 ) Update ruff to 0.5.0 so we can enable all the some of the new checks I've been wanting to add to the codebase. First just updating the code to comply with some rule changes and a couple minor API changes / deprecations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129744 Approved by: https://github.com/ezyang	2024-06-28 21:49:56 +00:00
Mohamed Yassine Kabouri	0a337613f8	Fix typo in stack_module_state doc (#129126 ) I think there is a typo in the first example of the `torch.func.stack_module_state` documentation. The first parameter in the function call in the `wrapper` return is missing an 's'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129126 Approved by: https://github.com/zou3519	2024-06-28 21:36:40 +00:00
Xuehai Pan	f5ff1a3ab9	[BE] enforce style for empty lines in import segments (#129751 ) This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet: > Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one) `usort` allows empty lines within import segments. For example, `usort` do not change the following code: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style: 1. no empty lines within segments. 2. single empty line between segments. 3. two spaces after import statements. All the code snippets above will be formatted to: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` which produces a consistent code style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751 Approved by: https://github.com/malfet	2024-06-28 21:02:59 +00:00
Joona Havukainen	5b96a552df	Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484 ) Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value. The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484 Approved by: https://github.com/kulinseth	2024-06-28 20:57:40 +00:00
Zaida Zhou	bc8883a7c4	fix the error msg in device_mesh (#129747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129747 Approved by: https://github.com/awgu, https://github.com/wconstab	2024-06-28 20:12:09 +00:00
Mikayla Gawarecki	45f3e20527	Improve error message for weights_only load (#129705 ) As @vmoens pointed out, the current error message does not make the "either/or" between setting `weights_only=False` and using `add_safe_globals` clear enough, and should print the code for the user to call `add_safe_globals` New formatting looks like such In the case that `add_safe_globals` can be used ```python >>> import torch >>> from torch.testing._internal.two_tensor import TwoTensor >>> torch.save(TwoTensor(torch.randn(2), torch.randn(2)), "two_tensor.pt") >>> torch.load("two_tensor.pt", weights_only=True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TwoTensor])` to allowlist this global if you trust this class/function. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` For other issues (unsupported bytecode) ```python >>> import torch >>> t = torch.randn(2, 3) >>> torch.save(t, "protocol_5.pt", pickle_protocol=5) >>> torch.load("protocol_5.pt", weights_only=True) /data/users/mg1998/pytorch/torch/_weights_only_unpickler.py:359: UserWarning: Detected pickle protocol 5 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this. warnings.warn( Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Unsupported operand 149 Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` Old formatting would have been like: ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1203, in load raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you get the file from a trusted source. Alternatively, to load with `weights_only` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals` to allowlist this global if you trust this class/function. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129705 Approved by: https://github.com/albanD, https://github.com/vmoens ghstack dependencies: #129239, #129396, #129509	2024-06-28 19:36:31 +00:00
Rachel Guo	99456a612b	[AOTI] Properly indent launchKernel calls in AOTInductor (#129616 ) Summary: There is a small cosmetic issue in the C++ wrapper file generated by AOTInductor - The launchKernel() call isn't properly indented. Added indentation for launchKernel() code block call when there's a "if" condition. a.k.a when `grid_uses_symbolic_shapes` is `True`. Test Plan: Test cmd ran (in pytorch oss): `TORCH_LOGS="output_code" TORCH_COMPILE_DEBUG=1 python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols_abi_compatible_cuda` And then manually verified the output code generated in a path like `/tmp/torchinductor_guorachel/coraisesuchpl3qabrazn7ydydszcit6lwpn7ckd3b4wej4rep5l/cba5g5ajeh5sym3tp5iqn7kkokimj7qqd4krs2rruhupbfqgppge.cpp` Similarly, also verified for test case:`test_zero_grid_with_unbacked_symbols_abi_compatible_cuda` Differential Revision: D58897157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129616 Approved by: https://github.com/ColinPeppler	2024-06-28 19:16:18 +00:00
Animesh Jain	6120aa3718	[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 ) TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow. With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model. Functionality impact - The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR). Perf impact - I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). Typing impact - I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #129163	2024-06-28 18:30:13 +00:00
zrr1999	db4c7bb7fc	Refine typing annotation for compile (#129136 ) before ![image](https://github.com/pytorch/pytorch/assets/46243324/91372d0f-ad0e-4abe-9582-7fe892f99ec8) after ![image](https://github.com/pytorch/pytorch/assets/46243324/175066ff-78f9-44a1-a3bb-5df809f7e86d) Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129136 Approved by: https://github.com/ezyang	2024-06-28 17:57:44 +00:00
FEI	59e4e92556	sdp::SDPBackend::flash_attention support PrivateUse1 (#126392 ) Fixes https://github.com/pytorch/pytorch/issues/124271 cc @cpuhrsch @drisspg @albanD @soulitzer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392 Approved by: https://github.com/drisspg	2024-06-28 17:48:40 +00:00
Chien-Chin Huang	26d633b721	[BE] Correctly catch skip signals emitting from sys.exit in Sandcastle (#129731 ) https://github.com/pytorch/pytorch/pull/129581 does not work correctly with Sandcastle environment. This PR fixes the issue. Differential Revision: [D59144062](https://our.internmc.facebook.com/intern/diff/D59144062/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129731 Approved by: https://github.com/wz337	2024-06-28 17:24:12 +00:00
Isuru Fernando	c12a4f2e65	Add decomposition for slice_scatter (#123744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744 Approved by: https://github.com/peterbell10	2024-06-28 17:02:10 +00:00
Joel Schlosser	6897631ceb	Guard on inner tensor names for traceable wrapper subclasses (#129618 ) Fixes #129601 Background: it's possible that a traceable wrapper subclass will have an optional inner tensor constituent (e.g. NJT's cached min / max sequence lengths). To specify this, the subclass's `__tensor_flatten__()` impl should leave out any unspecified optional inner tensors in the returned list of `attrs`. This PR guards on the list of inner tensor `attrs` returned in `subclass.__tensor_flatten__()[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129618 Approved by: https://github.com/anijain2305	2024-06-28 16:30:25 +00:00
Ying Zhao	b84036e3fb	[AOTI] Fix test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation (#129173 ) Fixes #122978 ## Summary To fix compilation error for test test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation - Error 1 ``` error: no matching function for call to ‘torch::aot_inductor::ArrayRefTensor<float>::ArrayRefTensor(float [1], const int64_t [0], const int64_t [0], int&, int32_t&)’ 613 \| ArrayRefTensor<float> buf3(buf3_storage, int_array_6, int_array_6, cached_torch_device_type_cpu, this->device_idx_); \| ^ ... torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:188:35: note: no known conversion for argument 2 from ‘const int64_t [0]’ {aka ‘const long int [0]’} to ‘torch::aot_inductor::MiniArrayRef<const long int>’ 188 \| MiniArrayRef<const int64_t> sizes, \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~ ``` Fix: added constructor for empty array in arrayref_tensor.h - Error 2 ``` error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’} 625 \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw)); \| ^~~~ \| \| \| torch::aot_inductor::ArrayRefTensor<float> ``` Fix: in cpp_wrapper_cpu.py, added codegen to call convert ArrayRefTensor to AtenTensorHandle first. ## Test Plan ``` python test/inductor/test_aot_inductor.py -k AOTInductorTestABICompatibleCpuWithStackAllocation.test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation ``` Before the fix, detailed in #122978: ``` \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw)); \| ^~~~ \| \| \| torch::aot_inductor::ArrayRefTensor<float> /home/yingzhaoseattle/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/utils.h:34:8: note: in definition of macro ‘AOTI_TORCH_ERROR_CODE_CHECK’ Ran 1 test in 4.377s FAILED (errors=1) ``` After the fix ``` /home/yingzhaoseattle/pytorch/torch/backends/cudnn/__init__.py:107: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('extern_calls', 1)] . ---------------------------------------------------------------------- Ran 1 test in 9.633s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129173 Approved by: https://github.com/chenyang78	2024-06-28 16:27:42 +00:00
Oguz Ulgen	04264efab6	Add structured logging on FXGraphCache hit (#129588 ) We'll also want to do this for AOTAutogradCache once that's ready Differential Revision: [D59144226](https://our.internmc.facebook.com/intern/diff/D59144226) Co-authored-by: Oguz Ulgen <oulgen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129588 Approved by: https://github.com/oulgen, https://github.com/xmfan	2024-06-28 16:06:22 +00:00
Xuehai Pan	e40f50cb87	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-28 15:37:57 +00:00
Xuehai Pan	494057d6d4	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-28 15:37:57 +00:00
Xuehai Pan	59eb2897f1	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-28 15:37:54 +00:00
Xu Han	2e3ff394bf	[inductor] optimize cpp builder configuration code (#129577 ) Changes: 1. Combine choose isa condition dispatch code. 2. Unificate MacOS openmp configuration code. 3. Clean up useless code. Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-28 15:08:54 +00:00
Manuel Candales	eabe6574c0	[metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628 Approved by: https://github.com/kimishpatel	2024-06-28 15:01:30 +00:00
Andrew Gu	635d6c9d66	[FSDP2] Ran post-acc-grad hooks manually (#129450 ) FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually. Discussion Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity. Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not. Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually. Caveats - Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass). - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be. - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers. - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`. - The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream. - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues. - This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope. Experiments (torchtitan) - Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision: - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped) - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450 Approved by: https://github.com/weifengpy, https://github.com/yf225	2024-06-28 14:50:09 +00:00
Nikita Shulga	fe4032fe20	[BE][CMake] Do not use `EXEC_PROGRAM` (#129714 ) It was deprecated since CMake-3.0 in favor of `execute_process`, see https://cmake.org/cmake/help/v3.18/command/exec_program.html This makes the following warning disappear: ``` CMake Warning (dev) at cmake/Modules/FindARM.cmake:5 (EXEC_PROGRAM): Policy CMP0153 is not set: The exec_program command should not be called. Run "cmake --help-policy CMP0153" for policy details. Use the cmake_policy command to set the policy and suppress this warning. Use execute_process() instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129714 Approved by: https://github.com/kit1980	2024-06-28 13:29:52 +00:00
Yu, Guangye	98d34d849d	Add a XPU UT to ensure lazy init (#129638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129638 Approved by: https://github.com/gujinghui	2024-06-28 13:22:17 +00:00
Randolf Scholz	22a06869f2	include jit/*.pyi (#129654 ) Fixes #108781, see https://github.com/pytorch/pytorch/pull/108782#issuecomment-1927321532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129654 Approved by: https://github.com/ezyang	2024-06-28 12:40:11 +00:00
Xu Han	424068d0d2	[Windows] remove mkl shared library dependency. (#129493 ) # Background I have fixed pytorch Windows missing mkl shared library dependency issue: https://github.com/pytorch/pytorch/issues/124009 The solution is change torch_cpu module static link mkl library: 1. pytorch static link mkl PR: https://github.com/pytorch/pytorch/pull/124925 2. builder install mkl static library: https://github.com/pytorch/builder/pull/1790 Double confirmed current build is using mkl static link: https://github.com/pytorch/pytorch/issues/124009#issuecomment-2160941802 # Goal Remove setup.py `install_requires` will install mkl shared lib on pytorch Windows. It is not required now, due to we have static linked it. It will reduce the pytorch install network traffic and avoid install useless mkl shared library package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129493 Approved by: https://github.com/malfet	2024-06-28 11:42:21 +00:00
Shan19900305	a0dac3de31	Noise tensor using same size/stride with input to promote performance when channel last situation. (#129467 ) All ops in _dropout_impl function are point-wise op. When input and output tensors are with same size and stride, those operators will get better performance. So i have remove memory in at::empty_like in make noise tensor. @ezyang Test code: ``` import torch input1 = torch.randn((50, 20, 50 ,30)).cuda() input2 = torch.randn((50, 20, 50 ,30)).cuda().to(memory_format=torch.channels_last) input3 = torch.randn((50, 20, 50 , 50)).cuda()[...,10:40] dropout = torch.nn.Dropout(p=0.5, inplace=True) # warmup: for i in range(20): output = dropout(input1) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) num = 10000 start_event.record() for i in range(num): output = dropout(input1) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input1 each time: {0}.".format(time * 1.0/num), flush =True) start_event.record() for i in range(num): output = dropout(input2) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input2 each time: {0}.".format(time * 1.0/num), flush =True) start_event.record() for i in range(num): output = dropout(input3) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input3 each time: {0}.".format(time * 1.0/num), flush =True) ``` Test result: \| 算子名称 \| 输入信息size / stride \| empty是否携带连续性参数 \| 耗时（ms） \| 备注 -- \| -- \| -- \| -- \| -- \| -- 1 \| dropout \| (50, 20, 50 ,30) / (30000, 1500, 30, 1) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0426735 \| 2 \| dropout \| (50, 20, 50 ,30) / (30000, 1, 600, 20) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0461689 \| 3 \| dropout \| (50, 20, 50 ,30) / (50000, 2500, 50, 1) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0512882 \| 4 \| dropout \| (50, 20, 50 ,30) / (30000, 1500, 30, 1) \| 空，根据输入决定size/stride \| 0.0426598 \| 对比1,基本一致 5 \| dropout \| (50, 20, 50 ,30) / (30000, 1, 600, 20) \| 空，根据输入决定size/stride \| 0.0422751 \| 对比2,提升8.4%左右 6 \| dropout \| (50, 20, 50 ,30) / (50000, 2500, 50, 1) \| 空，根据输入决定size/stride \| 0.0509037 \| 对比3,基本一致 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129467 Approved by: https://github.com/ezyang	2024-06-28 10:06:13 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit b7e7a4cb01de394af7686ab6feb216a8a5c716bb. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
PyTorch MergeBot	d21993bbb8	Revert "[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 )" This reverts commit 7854d84acbfb7a4e3e807951188535a0316b585e. Reverted https://github.com/pytorch/pytorch/pull/129587 on behalf of https://github.com/huydhn due to Sorry for revert yet another of your change but I need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196187332 ([comment](https://github.com/pytorch/pytorch/pull/129587#issuecomment-2196198756))	2024-06-28 06:01:07 +00:00
PyTorch MergeBot	c43923a116	Revert "[Inductor] FlexAttention supports block sparse mask (#129216 )" This reverts commit b9d3cedd648d4ed9d0bf5b918893341e5f95289a. Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is still failing in trunk `b9d3cedd64`, maybe a landrace given that TD has been turned off ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2196182882))	2024-06-28 05:44:46 +00:00
hippocookie	73eb4503cc	Enable UFMT for numpy_test files, test_xnnpack_integration.py (#129023 ) Fixes #123062 Run lintrunner on files: test/test_xnnpack_integration.py ```bash $ lintrunner FLAKE8 success! CLANGFORMAT success! MYPY success! MYPYSTRICT success! CLANGTIDY success! TYPEIGNORE success! TYPENOSKIP success! NOQA success! NATIVEFUNCTIONS success! NEWLINE success! CONSTEXPR success! SPACES success! TABS success! INCLUDE success! PYBIND11_INCLUDE success! ERROR_PRONE_ISINSTANCE success! PYBIND11_SPECIALIZATION success! PYPIDEP success! EXEC success! CUBINCLUDE success! RAWCUDADEVICE success! RAWCUDA success! ROOT_LOGGING success! DEPLOY_DETECTION success! CMAKE success! SHELLCHECK success! ACTIONLINT success! TESTOWNERS success! TEST_HAS_MAIN success! CALL_ONCE success! ONCE_FLAG success! WORKFLOWSYNC success! UFMT success! COPYRIGHT success! BAZEL_LINTER success! LINTRUNNER_VERSION success! ATEN_CPU_GPU_AGNOSTIC success! MERGE_CONFLICTLESS_CSV success! RUFF success! ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129023 Approved by: https://github.com/ezyang	2024-06-28 05:40:31 +00:00
Peter Bell	b019f38fdd	[inductor] Fix pattern replacements with multiple users (#129689 ) Fixes #129685 After matching a pattern, we currently try to remove all the nodes of that pattern, which doesn't work if any intermediate node has users outside of the pattern. In which case we can't delete those particular nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129689 Approved by: https://github.com/shunting314	2024-06-28 05:16:17 +00:00
eqy	7854d84acb	[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 ) Fix for #129579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-06-28 04:42:45 +00:00
Daniel Richard G.	8d4216af8c	Fix compile error with Intel oneAPI compiler (#129589 ) I am building PyTorch with the Intel oneAPI 2024.0.0 compiler, and encountered this compile error: ``` [ 85%] Building CXX object caffe2/CMakeFiles/cpu_rng_test.dir/__/aten/src/ATen/test/cpu_rng_test.cpp.o In file included from /home/src/pytorch/aten/src/ATen/test/cpu_rng_test.cpp:2: /home/src/pytorch/aten/src/ATen/test/rng_test.h:119:41: error: loop variable 'to' creates a copy from type 'const ::std::optional<int64_t>' (aka 'const optional<long>') [-Werror,-Wrange-loop-construct] 119 \| for (const ::std::optional<int64_t> to : tos) { \| ^ /home/src/pytorch/aten/src/ATen/test/rng_test.h:119:10: note: use reference type 'const ::std::optional<int64_t> &' (aka 'const optional<long> &') to prevent copying 119 \| for (const ::std::optional<int64_t> to : tos) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| & 1 error generated. ``` This change makes the compiler happy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129589 Approved by: https://github.com/colesbury	2024-06-28 02:35:10 +00:00
Yidi Wu	4b8a5e0374	[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 ) Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph. This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident. Test Plan: Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680 Approved by: https://github.com/angelayi	2024-06-28 02:22:30 +00:00
Nikita Shulga	4b598d87d3	Fix FindBLAS.cmake (#129713 ) Fixes regression introduced by https://github.com/pytorch/pytorch/pull/125227 by adding `INCLUDE(CheckFunctionExists)` that fixes ``` CMake Error at cmake/Modules/FindBLAS.cmake:413 (check_function_exists): Unknown CMake command "check_function_exists". ``` Fixes https://github.com/pytorch/pytorch/issues/129693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129713 Approved by: https://github.com/kit1980	2024-06-28 02:15:16 +00:00
Yanbo Liang	b9d3cedd64	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-28 01:32:54 +00:00
Will Feng	c07a799ed5	[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 ) Test command: `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247 Approved by: https://github.com/bdhirsh ghstack dependencies: #129502	2024-06-28 01:04:49 +00:00
xinan.lin	36b9d9cfcd	[Inductor UT] Generalize device-bias code in newly added UT `test_scatter_optimization.py` (#129622 ) [Inductor UT] Generalize device-bias code in newly added UT test_scatter_optimization.py and test_torchinductor_dynamic_shapes.py Fix issue #129624 , #129642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129622 Approved by: https://github.com/EikanWang, https://github.com/peterbell10	2024-06-28 01:04:21 +00:00
Shangdi Yu	deaab33f3f	[custom op] add error message (#129417 ) Fixes [#129370](https://github.com/pytorch/pytorch/issues/129370) Suggest correct a List type annotation when input is in Tuple type. To avoid confusion, we only suggest a type if the type is supported. Example: Tuple[int, int] -> List[int] Tuple[Tensor, Tensor, Optional[Tensor]] -> List[Optional[Tensor]] Tuple[int, ...] -> List[int] ValueError: infer_schema(func): Parameter y has unsupported type typing.Tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]. Tuple type annotation is not supported. Please try to use a List instead. For example, typing.List[typing.Optional[torch.Tensor]]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129417 Approved by: https://github.com/zou3519	2024-06-28 01:03:14 +00:00
PyTorch MergeBot	8ba0f6c7c2	Revert "[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 )" This reverts commit f2840bb22079a6952c61446a3d0dfc12f6452852. Reverted https://github.com/pytorch/pytorch/pull/129164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal dper3 tests ([comment](https://github.com/pytorch/pytorch/pull/129164#issuecomment-2195888838))	2024-06-28 00:49:39 +00:00
Xuehai Pan	9e1f3ecaa7	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-28 00:35:15 +00:00
Nikita Shulga	d4b6ff6fbe	Disable llm-td step (#129722 ) As it often fails during conda install step with `Unexpected HTTP response: 429` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129722 Approved by: https://github.com/kit1980, https://github.com/clee2000	2024-06-28 00:12:32 +00:00
Will Feng	0ffb17547e	[Simple FSDP] Add unit test for torch.compile + reparameterization + SAC (#129641 ) This can reproduce the error in https://github.com/pytorch/pytorch/issues/129684. Adding a unit test so that we hold the line for torch.compile + reparameterization + SAC to always be working, to pave the path for Tianyu's intern's project. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129641 Approved by: https://github.com/tianyu-l	2024-06-28 00:00:36 +00:00
Jeff Daily	169b4ca07e	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet	2024-06-27 23:53:13 +00:00
cyy	fb5888c719	Remove unused type traits in torch/csrc/utils (#128799 ) Follows #127852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128799 Approved by: https://github.com/ezyang	2024-06-27 23:51:18 +00:00
Peter Bell	3fc279633b	[ATen] Make argsort.stable CompositeImplicitAutograd (#129529 ) It literally just calls `at::sort` and returns the indices, so is composite compliant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129529 Approved by: https://github.com/lezcano	2024-06-27 23:49:16 +00:00
Xuehai Pan	7cf0b90e49	[BE] enable UFMT in `torch.utils.data` (#127705 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127705 Approved by: https://github.com/ezyang ghstack dependencies: #127706, #127704	2024-06-27 23:16:24 +00:00
Xuehai Pan	f911957573	[BE] sort imports in `torch.utils.data` (#127704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127704 Approved by: https://github.com/ezyang ghstack dependencies: #127706	2024-06-27 23:16:24 +00:00
Xuehai Pan	d80939e5e9	[BE] enable UFMT for `torch/storage.py` (#127706 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127706 Approved by: https://github.com/ezyang	2024-06-27 23:16:24 +00:00
Yifu Wang	67416a2996	[c10d] Introduce a util for detecting DMA connectivity among devices (#129510 ) This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices. The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible. `_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced. Example: ```python3 >>> from torch._C._autograd import DeviceType >>> from torch._C._distributed_c10d import _detect_dma_connectivity >>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink") >>> for row in connectivity.matrix: ... print(row) ... [0, 18, 18, 18, 18, 18, 18, 18] [18, 0, 18, 18, 18, 18, 18, 18] [18, 18, 0, 18, 18, 18, 18, 18] [18, 18, 18, 0, 18, 18, 18, 18] [18, 18, 18, 18, 0, 18, 18, 18] [18, 18, 18, 18, 18, 0, 18, 18] [18, 18, 18, 18, 18, 18, 0, 18] [18, 18, 18, 18, 18, 18, 18, 0] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510 Approved by: https://github.com/weifengpy	2024-06-27 23:02:07 +00:00
yousufmo	305ba62906	Add support to `GradScaler` for respecting an already set `grad_scale` value (#123429 ) Fixes #123428 Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429 Approved by: https://github.com/ezyang	2024-06-27 22:40:54 +00:00
Will Constable	83a4a8b510	[C10D] clean up pointless 'or None' clause (#129522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129522 Approved by: https://github.com/awgu	2024-06-27 22:40:11 +00:00
Chien-Lin Chen	5e7ac69a67	[Dynamic Shapes] fixed dynamic shape inference (#128807 ) Made dynamic dimension indirectly bound to an integer constrained. After each ShapeEnv._refine_ranges, check if the new ValueRange is singleton, if it is, replace the symbol. Fixes #122307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128807 Approved by: https://github.com/ezyang	2024-06-27 22:33:32 +00:00
Catherine Lee	b8398b771c	Upload test stats when workflow regardless of conclusion (#129694 ) Upload test stats when workflow always so that we can get status for cancelled workflows (especially ones that were cancelled manually) There aren't that many workflow conclusions, so might as well as always run it, and we can see what happens Undos [this old PR](https://togithub.com/pytorch/pytorch/pull/79180) Notable pitfalls from the above: Might cause noise if things can't be downloaded, but since this workflow doesn't show up on PRs, I think it's ok to slowly deal with what comes Pull Request resolved: https://github.com/pytorch/pytorch/pull/129694 Approved by: https://github.com/huydhn	2024-06-27 21:12:21 +00:00
Shivam Raikundalia	1d0efedc85	[Profiler] Add TSC Clock Callback to CUPTI (#125036 ) Summary: Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885 This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler Test Plan: Obtained following trace using resnet test: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces TBD: Add benchmarks Differential Revision: D56584521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036 Approved by: https://github.com/aaronenyeshi	2024-06-27 21:07:43 +00:00
Xu Han	602b5cb218	[inductor] switch HalideCodeCache to new cpp_builder. (#129441 ) Original PRs is damaged by confilct and rebase: https://github.com/pytorch/pytorch/pull/128303, https://github.com/pytorch/pytorch/pull/129144 This PR just switch `HalideCodeCache` to new cpp_builder and it is not `fb_code` related. It can merge without `fb_code` test. Let's land this change firstly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129441 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-27 20:50:13 +00:00
Tugsbayasgalan Manlaibaatar	39427288f4	Taskify training IR + run_decomp flow failures (#129547 ) Differential Revision: [D59069088](https://our.internmc.facebook.com/intern/diff/D59069088) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129547 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #128077, #129092, #129249	2024-06-27 20:43:22 +00:00
Yidi Wu	23adf166e1	[cond] inlining into one of the branches when pred is a python constant (#128709 ) When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709 Approved by: https://github.com/zou3519	2024-06-27 20:28:50 +00:00
Sanket Jayant Purandare	71f5ecd1ee	Fixed Memory Leaks in tests (#129640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129640 Approved by: https://github.com/clee2000 ghstack dependencies: #129400	2024-06-27 20:26:21 +00:00
Tugsbayasgalan Manlaibaatar	dabaebd339	Make run_decomp work (#129249 ) In this PR, we implement the first version of training_ir.run_decomp functionality. Since we don't return the modified buffers as extra output in training IR, our previous strategy of reusing graph signature won't work. In fact, this run_decomp is more similar to retracing. So i reuse some of export steps here. After this PR: export_for_training().run_decomp({}, _preserve_ops=[all 183 ops]) == export_for_predispatch() - autograd_manipulating_ops. Differential Revision: [D59069090](https://our.internmc.facebook.com/intern/diff/D59069090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129249 Approved by: https://github.com/zhxchen17 ghstack dependencies: #128077, #129092	2024-06-27 19:16:07 +00:00
Tugsbayasgalan Manlaibaatar	ec284d3a74	Prototype for export_for_training (#129092 ) This PR implements export_for_training where the IR is not-functional, pre-dispatch aten IR. The general strategy: 1. Call dynamo to get torch IR 2. Lift param/buffer 3. call make_fx TODO: 1. run_decomp doesn't work 2. not-strict is not supported Differential Revision: [D59069087](https://our.internmc.facebook.com/intern/diff/D59069087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129092 Approved by: https://github.com/zhxchen17 ghstack dependencies: #128077	2024-06-27 18:27:11 +00:00
Angela Yi	4dcc1ceff3	[dynamo] Fakify result of delegate (#128752 ) Summary: Somehow the delegate returns a real tensor result even though we pass in fake tensors. So here we need to convert the result to fake. Test Plan: `buck2 run @//mode/dev-nosan //on_device_ai/helios/multi_zion:multi_zion_test -- -r test_single_delegate_dsp_only` Differential Revision: D58617091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128752 Approved by: https://github.com/ydwu4	2024-06-27 17:59:52 +00:00
Zain Rizvi	389492e264	Fix runner determinator bug (#129612 ) Currently the runner determinator is buggy and doesn't let anyone's workflows run against the LF runners (it prefixes a "@" to the user names in the issue instead of either stripping it or prefixing it to the incoming names) This PR fixes the bug so that people opted in to using LF runners can actually use them. It also puts the python code back into the repo. Even though the code isn't directly invoked, having it there makes testing and linting easier/possible Also includes lint fixes Note: if you just review the .yml file you'll see all the relevant diffs ### Testing: #### Before ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo {"label_type": "", "message": "LF Workflows are disabled for ZainRizvi, ZainRizvi. Using meta runners."} ``` #### After ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi, ZainRizvi. Using LF runners."} ``` Aside: updated test case after rebase: ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi2 --github-branch foo --github-repo python/pythonss --github-ref-type branch {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129612 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-06-27 17:51:09 +00:00
Brian Hirsh	a4d7aa498b	[Traceable FSDP2] Add auto-functionalize support for mutable list[Tensor] (copy from Brian's PR #127347 ); enable E2E inductor unit test for transformer model (#129502 ) Copy of Brian's PR: https://github.com/pytorch/pytorch/pull/127347 with additional changes to support mutable `List[Tensor]` in Inductor. Also enable E2E inductor unit test for Traceable FSDP2 + transformer model. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_set_` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_aot_eager` - `pytest -rA test/dynamo/test_misc.py::MiscTests::test_auto_functionalize_tensorlist` - `pytest -rA test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_list_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129502 Approved by: https://github.com/zou3519	2024-06-27 17:50:57 +00:00
Aleksei Nikiforov	9174d14551	Don't install remaining caffe2 python files (#129067 ) It is assumed that they are no longer needed. And keeping their installation as is breaks "python setup.py develop --user" workflow when non-root user is used. This change is follow up for 3d617333e700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129067 Approved by: https://github.com/cyyever, https://github.com/r-barnes	2024-06-27 17:25:59 +00:00
Richard Barnes	e0bba37d66	[codemod] Add `[[noreturn]]` to 2 files inc caffe2/c10/util/TypeCast.cpp (#129575 ) Summary: LLVM-15 has a warning `-Wno-return` which can be used to identify functions that do not return. Qualifying these functions with `[[noreturn]]` is a perf optimization. Test Plan: Sandcastle Reviewed By: dmm-fb Differential Revision: D59003594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129575 Approved by: https://github.com/Skylion007	2024-06-27 17:23:22 +00:00
Dmitry Rogozhkin	321bdcb372	Fix device propagation for checkpointing (#128671 ) Fixes: #128478 In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU. This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671 Approved by: https://github.com/guangyey, https://github.com/soulitzer	2024-06-27 17:14:13 +00:00
Jeff Daily	04206d1898	TunableOp hotfix, unit test follow-up (#129606 ) PR #129281 was landed to fix critical issues but did not contain unit tests to exercise those issues. This is a follow-up set of unit tests that would exercise the problems seen previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129606 Approved by: https://github.com/atalman	2024-06-27 17:01:04 +00:00
Peter Bell	5c6af2b583	[cpu] Fix div with rounding_mode="floor" when division overflows (#129536 ) Fixes #77742 `Sleef_fmod` returns NaN when the division overflows, where `libm` returns 0. In this narrow case we can drop the `fmod` from the calulation entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129536 Approved by: https://github.com/lezcano	2024-06-27 16:50:47 +00:00
PyTorch MergeBot	5ceba6a3cb	Revert "[Inductor] FlexAttention supports block sparse mask (#129216 )" This reverts commit 4082759925a712b7cb340164d3da3a1dab372d9f. Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/clee2000 due to broke functorch/aot_dispatch and test_proxy_tensor on windows https://github.com/pytorch/pytorch/actions/runs/9691331440/job/26743164471 `4082759925` missed on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2195087274))	2024-06-27 15:57:52 +00:00
Adnan Akhundov	82c8fc3a2b	[inductor] Add size_hint to conv dilation (#129631 ) Summary: [Here](`ea588d7fd3/torch/_inductor/kernel/conv.py (L252)`) in the `conv` lowering `dilation` is not `size_hint`-ed. This breaks if `dilation` is a symbolic expression (which we see in some internal models). The PR fixes it by adding a `size_hints`. Test Plan: ``` $ python test/inductor/test_torchinductor.py -k test_convolution5 ... ---------------------------------------------------------------------- Ran 2 tests in 7.329s OK ``` Differential Revision: D59097019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129631 Approved by: https://github.com/chenyang78	2024-06-27 15:27:57 +00:00
Chien-Chin Huang	483dbfcf2a	[BE] Correctly catch skip signals emitting from sys.exit (#129581 ) Some tests in test_c10d_nccl.py overwrite `_join_process()` and `_check_return_codes()`, which cause the skip signals are not catched appropriately. This PR fixes the issue. Differential Revision: [D59067457](https://our.internmc.facebook.com/intern/diff/D59067457/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129581 Approved by: https://github.com/fduwjj	2024-06-27 15:12:51 +00:00
Huy Do	2d9012ad25	Forward fix internal pyre failure from D58983461 (#129525 ) Summary: Somehow, using underscore alias of some builtin types breaks pyre Test Plan: All failed tests from D58983461 are passing: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/utils/tests:gpu_memory_utils_test-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:device_util-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:thompson_samplers_gpu-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:combined_sampling_diversifier_test-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:submodular_opt_test-type-checking ``` Differential Revision: D59029768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129525 Approved by: https://github.com/XuehaiPan, https://github.com/clee2000, https://github.com/malfet	2024-06-27 14:41:20 +00:00
Aaron Enye Shi	0680e6cd1c	[Profiler] Add sraikund16 to profiler paths in CODEOWNERS (#129591 ) Summary: Add Shivam to the list of code owners for the profiler code paths, so that Shivam gets added to reviewers for PRs too. Test Plan: CI Differential Revision: D59072152 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129591 Approved by: https://github.com/sraikund16	2024-06-27 14:22:09 +00:00
Animesh Jain	ad607b91f4	[dynamo][onnx] Skip some dynamic=True test with inlining in built nn modules (#129610 ) These tests fail with dynamic=True when inlining in built nn modules. There are a few more recompilations. Since `dynamic=True` is not a recommended usage, I am skipping these tests for now. This is the tracking issue to come back later and fix/update these tests - https://github.com/pytorch/pytorch/issues/129456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129610 Approved by: https://github.com/yanboliang ghstack dependencies: #129583	2024-06-27 10:56:24 +00:00
Chen, Zejun	a028e5862d	[profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554 ) Use the raw end_ns directly, instead of the sum of start_ns and duration_ns, in order to avoid negative CPU time in profiler. Fix https://github.com/pytorch/pytorch/issues/101861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129554 Approved by: https://github.com/gujinghui, https://github.com/aaronenyeshi	2024-06-27 10:46:05 +00:00
y-sq	ff026f3d0a	Fix an issue in meta_scaled_mm (#129521 ) Summary: To fix the following failure cases: For example, when `M, K, N = 245760, 656, 6560`, fp8 with compile fails due to `RuntimeError: mat2 must be col_major`. --------- From the inductor generated code (https://fburl.com/everpaste/epcagkrd) ``` V0625 01:38:55.551000 140329914449920 torch/_inductor/scheduler.py:1623] [0/0] scheduling ComputedBuffer(name='buf12', layout=FixedLayout('cuda', torch.float8_e4m3fn, size=[656, 6560], stride=[6656, 1]), ... ... V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] buf12 = empty_strided_cuda((656, 6560), (6656, 1), torch.float8_e4m3fn) ... ... V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] return (buf10, buf2, buf5, buf6, reinterpret_tensor(buf11, (245760, 656), (1, 245760), 0), reinterpret_tensor(buf12, (6560, 656), (1, 6656), 0), ) ... ... V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] assert_size_stride(permute_10, (6560, 656), (1, 6656)) ... ... V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] buf8 = aten._scaled_mm.default(buf6, permute_10, buf7, reciprocal_3, None, None, torch.bfloat16) ``` Inductor gives the mat2 (`permute_10`) a different stride (`6656`) instead of using its shape[0] (`(6560, 656)`). Therefore, the `stride[1] == shape[0]` condition fails. To fix the issue, simply modify the `is_col_major` check to exclude this condition as it doesn't hold for all valid cases. Test Plan: Run the failed case again. It works with the fix. ----- Sandcastle / GitHub CI will make sure the existing tests could still pass. Reviewed By: vkuzo Differential Revision: D58994704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129521 Approved by: https://github.com/drisspg	2024-06-27 07:03:34 +00:00
Yang Cao	9f29a2291c	Feat: Updated torch.nn.Modules.set_submodules() (#127714 ) modified: torch/nn/modules/module.py Implemented feature request by #127712. Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127714 Approved by: https://github.com/mikaylagawarecki	2024-06-27 06:38:54 +00:00
Animesh Jain	c9798d123b	[dynamo][compile-time] Manually trace torch.nn.Module.parameters (#129583 ) With this PR, we are not worse than no-inlining for Dynamo-only compilation time (there is a litte bit of noise, so outlier of 0.89 is probably ok here). For most of the models, we see positive numbers because of better caching in `UserDefinedObjectVariable`. ![image](https://github.com/pytorch/pytorch/assets/13822661/719d34fd-3e7f-4886-b7e0-1dbfc7141aa5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129583 Approved by: https://github.com/jansel	2024-06-27 06:06:04 +00:00
Valentin Andrei	cf392d8a89	[pytorch][cuda] Generate kernels for 5x5 filters on depth wise convolution backward (#129609 ) In #125362 we improved the default implementation of depth wise convolution 2D forward pass by precomputing boundaries of accessed slices instead of doing expensive edge checks in the inner loops. We also generated kernels for 5x5 filters as this is a common problem size. In this PR we tried to applied the same strategy for the backward kernel but we only saw good gains just by generating code for 5x5 filters. We could also write a fallback implementation that precomputes access boundaries when filter size and stride are not known at compile time may bring some speedup but that kernel would very rarely be called. This PR also hints the thread count at compile time and leaves only the unroll directive that seems to help performance. Before: ``` B C iH iW kH kW conv2d-backward (cuda) conv2d-fp16-backward (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 89.002686 26.400480 1 8.0 64.0 1008.0 1008.0 5.0 5.0 88.885025 25.995296 2 4.0 48.0 720.0 539.0 6.0 1.0 9.488832 9.091136 3 4.0 120.0 379.0 283.0 6.0 1.0 4.194640 3.844432 4 4.0 32.0 713.0 532.0 6.0 1.0 8.027296 7.700064 5 4.0 3.0 712.0 542.0 31.0 31.0 15.618095 15.097760 6 4.0 120.0 379.0 288.0 1.0 6.0 3.788224 3.499648 7 1024.0 384.0 1.0 928.0 1.0 3.0 18.988289 14.152768 8 4.0 24.0 687.0 512.0 6.0 1.0 6.902704 6.685056 9 96.0 96.0 112.0 112.0 5.0 5.0 15.672400 4.953984 10 96.0 80.0 56.0 56.0 5.0 5.0 3.261152 1.250320 11 64.0 128.0 64.0 84.0 3.0 3.0 3.172192 1.515648 12 16.0 960.0 7.0 7.0 5.0 5.0 0.197024 0.072736 13 16.0 64.0 112.0 112.0 3.0 3.0 1.126240 0.650304 ``` After ``` conv2d-performance: B C iH iW kH kW conv2d-backward (cuda) conv2d-fp16-backward (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 76.278656 26.418720 1 8.0 64.0 1008.0 1008.0 5.0 5.0 73.211617 26.018433 2 4.0 48.0 720.0 539.0 6.0 1.0 8.901312 9.322912 3 4.0 120.0 379.0 283.0 6.0 1.0 3.815616 3.992208 4 4.0 32.0 713.0 532.0 6.0 1.0 7.753024 8.032433 5 4.0 3.0 712.0 542.0 31.0 31.0 15.244144 15.277296 6 4.0 120.0 379.0 288.0 1.0 6.0 3.503264 3.552976 7 1024.0 384.0 1.0 928.0 1.0 3.0 16.682976 14.167969 8 4.0 24.0 687.0 512.0 6.0 1.0 6.802576 7.019040 9 96.0 96.0 112.0 112.0 5.0 5.0 12.713024 4.958656 10 96.0 80.0 56.0 56.0 5.0 5.0 2.648352 1.254752 11 64.0 128.0 64.0 84.0 3.0 3.0 3.213568 1.517952 12 16.0 960.0 7.0 7.0 5.0 5.0 0.182208 0.076256 13 16.0 64.0 112.0 112.0 3.0 3.0 1.139952 0.652432 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129609 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-06-27 06:01:47 +00:00
Yanbo Liang	4082759925	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-27 05:44:27 +00:00
Jiang, Yanbing	5ee893a84a	Add inductor support for conv3d transpose (#129458 ) This PR is to add Conv3d Transpose support in inductor. Basicly reuse and expand Conv2d Transpose and unit tests to Conv3d Transpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129458 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-27 05:27:10 +00:00
Wei Wang	9b5b93c58f	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-27 05:22:18 +00:00
Yifu Wang	ea588d7fd3	[SymmetricMemory] use SCM_RIGHTS socket control message to share exported cumem handle (#129412 ) `SymmetricMemory` currently uses the `pidfd_getfd` syscall to share the exported cumem fd among devices. The syscall is introduced in linux kernel 5.6 which is relatively new and not available everywhere. This PR replaces the use of the `pidfd_getfd` syscall with socket + SCM_RIGHTS control message. The approach is demonstrated in [memMapIPCDrv](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/memMapIPCDrv) in [cuda-samples](https://github.com/NVIDIA/cuda-samples/tree/master/Samples) (relevant code: https://github.com/NVIDIA/cuda-samples/blob/master/Common/helper_multiprocess.cpp). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129412 Approved by: https://github.com/Chillee	2024-06-27 04:38:13 +00:00
Li-Huai (Allan) Lin	84ad5452f6	[MPS] Fused SGD optimizer (#129350 ) ``` [-------------------------------------- Fused SGD --------------------------------------] \| Fused: True \| Fused: False 1 threads: ------------------------------------------------------------------------------ numel: 1024, num_tensors: 100, momentum: True \| 2 \| 15 numel: 1024, num_tensors: 100, momentum: False \| 2 \| 5 numel: 65536, num_tensors: 100, momentum: True \| 3 \| 16 numel: 65536, num_tensors: 100, momentum: False \| 2 \| 5 numel: 1048576, num_tensors: 100, momentum: True \| 11 \| 16 numel: 1048576, num_tensors: 100, momentum: False \| 8 \| 6 numel: 1024, num_tensors: 500, momentum: True \| 29 \| 70 numel: 1024, num_tensors: 500, momentum: False \| 20 \| 24 numel: 65536, num_tensors: 500, momentum: True \| 33 \| 76 numel: 65536, num_tensors: 500, momentum: False \| 22 \| 26 numel: 1048576, num_tensors: 500, momentum: True \| 70 \| 80 numel: 1048576, num_tensors: 500, momentum: False \| 43 \| 40 numel: 1024, num_tensors: 1000, momentum: True \| 108 \| 139 numel: 1024, num_tensors: 1000, momentum: False \| 72 \| 48 numel: 65536, num_tensors: 1000, momentum: True \| 116 \| 150 numel: 65536, num_tensors: 1000, momentum: False \| 77 \| 52 numel: 1048576, num_tensors: 1000, momentum: True \| 190 \| 170 numel: 1048576, num_tensors: 1000, momentum: False \| 120 \| 50 ``` ```python def profile_fused_sgd(): from torch.optim.sgd import sgd import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, momentum_buffer_list, fused): fn( params, grads, momentum_buffer_list, momentum=True if len(momentum_buffer_list) > 0 else False, dampening=0.0, nesterov=False, foreach=False, fused=fused, lr=1e-3, weight_decay=.0, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]): sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}" print(sublabel) params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)] momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else [] fn = sgd for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, momentum_buffer_list, fused)', label='Fused SGD', sub_label=sublabel, globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350 Approved by: https://github.com/janeyx99 ghstack dependencies: #129006, #129008, #129007, #129105	2024-06-27 04:37:14 +00:00
Eddie Yan	e19042481b	[cuDNN][cuDNN Frontend] Bump cuDNN FE submodule to 1.5.2 (#129592 ) Some relevant fixes include stride-0 support 👀 CC @drisspg @Skylion007 @vedaanta Pull Request resolved: https://github.com/pytorch/pytorch/pull/129592 Approved by: https://github.com/Skylion007	2024-06-27 04:01:23 +00:00
Antoni Viros	9450e198aa	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-06-27 03:41:28 +00:00
Kurman Karabukaev	c9ceae3fac	Use JK for mast rdzv handler tcpstore handling and additional logging (#129603 ) Summary: Use JK to control the release instead of using env variable to toggle the feature. Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time. Test Plan: CI Differential Revision: D59071544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603 Approved by: https://github.com/d4l3k	2024-06-27 03:34:52 +00:00
Yidi Wu	b9697eacd3	[torchbind] support tensor ops inside of __obj_flatten__ (#129605 ) As titled. Previously, __obj_flatten__ can run in a fake tensor mode, e.g. in process_input of aot_autograd, which is surrounded by a fake tensor mode. This causes the tensor ops inside __obj_flatten__ to run under fake tensor mode. However, tensors inside of script obejct are real tensors, this causes the fake tensor mode to error out saying that we need to first fakify fall the tensors (because allow_non_fake_inputs is set to True). In this PR, we disable all the dispatch modes when running to_fake_obj. Note that, the output of `__obj_flatten__` will be fakified and filled inside of the corresponding FakeScriptObject. So during traicng, we'll be using FakeScriptObject that has fake tensor contents. Test Plan: Add a new test: pytest test/export/test_torchbind.py -k test_compile_tensor_op_in_tensor_flatten Pull Request resolved: https://github.com/pytorch/pytorch/pull/129605 Approved by: https://github.com/angelayi	2024-06-27 03:07:31 +00:00
Nikita Shulga	cdbd6542d0	Fix inductor benchmarks (#129620 ) By installing torchao explicitly, as torchao-0.3.0 that was release recently to pypi introduced hard dependency to torch-2.3.1, which results in following cryptic error: `RuntimeError: operator torchvision::nms does not exist` TODOs: - Figure out what installs torchao from pypi rather than builds from source - Add proper CI pin for torchao Pull Request resolved: https://github.com/pytorch/pytorch/pull/129620 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-06-27 02:59:08 +00:00
garfield1997	27a14405d3	enable device index check for all device types (#126767 ) enable device index check for all device types for grad setter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126767 Approved by: https://github.com/albanD	2024-06-27 01:09:53 +00:00
Boyuan Feng	0b7e8df7d8	[CUDAGraph Trees] Enable input mutation support in OSS (#129184 ) Summary: Enable input mutation support for cudagraph trees in OSS. Test Plan: CI Differential Revision: D58847850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129184 Approved by: https://github.com/eellison	2024-06-27 00:49:45 +00:00
yuqingj	7bb558fd6e	add _flash_attention_forward and _efficient_attention_forward to compute intensive ops in partitioner (#129533 ) Avoid recompute of SDPA during the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129533 Approved by: https://github.com/drisspg	2024-06-27 00:49:00 +00:00
Jiashen Cao	b6689e0fb8	[ts migration] add logging as part of torch logging system (#129405 ) #### Description Add more verbose logging of conversion process. Output which IR is being converted, which function is used to do conversion, and whether it succeeds. #### Example `TORCH_LOGS="+export,ts2ep_conversion" pytest test/export/test_converter.py -s -k test_prim_tolist` ``` test/export/test_converter.py I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] TorchScript graph I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] graph(%x.1 : Long(3, strides=[1], requires_grad=0, device=cpu)): I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject() I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %2 : int = prim::Constant[value=1](), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %3 : int = prim::Constant[value=0](), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] return (%4) I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject()] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_CreateObject] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%2 : int = prim::Constant[value=1](), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%3 : int = prim::Constant[value=0](), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_tolist] succeeds I0624 13:19:26.427000 140608224474112 torch/_export/converter.py:760] TS2EPConverter IR-to-IR conversion succeeds ``` #### Test Plan `pytest test/export/test_converter` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129405 Approved by: https://github.com/angelayi	2024-06-27 00:20:20 +00:00
Tugsbayasgalan Manlaibaatar	90f6043368	Don't decompose functional composite ops in export inference IR (#128077 ) Recently we decided to split export IR into two different IRs (training vs inference). In the inference IR, one major change we decided to introduce was we wanted to keep the composite ops that user specified in the IR. This PR does that by overriding the CompositeImplicitAutograd decomp in export inference path. Differential Revision: [D58701607](https://our.internmc.facebook.com/intern/diff/D58701607) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128077 Approved by: https://github.com/bdhirsh	2024-06-26 23:07:55 +00:00
Chirag Pandya	64f1111d38	Expose nholmann json to torch (#129570 ) Summary: Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`. The next PR makes actual use of this header. imported-using-ghimport Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D59035246 Pulled By: c-p-i-o Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570 Approved by: https://github.com/d4l3k, https://github.com/malfet	2024-06-26 21:59:26 +00:00
HOOLoLo	5ad2ad5921	Update start_, end_ and retired only for the right entry when retire a work (#128948 ) Fixes #128805 If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10, so when watchdog retire the work 0, the start_ and end_ of the entry 0 shouldn't be set to nullptr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128948 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-06-26 21:58:00 +00:00
Elias Ellison	b8e5678ad2	Delete lazy ddp optimizer (#120727 ) This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides. Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727 Approved by: https://github.com/jansel, https://github.com/yf225	2024-06-26 21:53:54 +00:00
Shivam Raikundalia	13316a8d46	[Profiler] Add Rank to NCCL Debug Info (#129528 ) Summary: We need to add the Rank information to the NCCL debug data so that kineto can infer all the necessary process group info such that on-demand can create distributedInfo metadata. Kineto portion will be added in a follow up diff Test Plan: Tested in D58736045, this diff just splits the kineto and profiler instances Differential Revision: D59028819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129528 Approved by: https://github.com/aaronenyeshi	2024-06-26 21:24:05 +00:00
Catherine Lee	7b1988f922	[ez] Give trymerge id token write permissions after #129503 (#129594 ) Forgot to do this in #129503 Also fix minor typo Pull Request resolved: https://github.com/pytorch/pytorch/pull/129594 Approved by: https://github.com/huydhn	2024-06-26 20:33:14 +00:00
Catherine Lee	795db80975	Upload release tag source code to s3 (#128842 ) Upload tarball containing source code to s3 for release tags Can be found here https://us-east-1.console.aws.amazon.com/s3/buckets/pytorch?region=us-east-1&bucketType=general&prefix=source_code/test/&showversions=false D58695048 for adding permissions to allow uploading to the s3 folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/128842 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-26 20:32:40 +00:00
Andrea Frittoli	28480dd7dc	[CI] Fix runner determinator for ciflow (#129500 ) In case of ciflow, runs are triggered by a tag which is created by @pytorchbot, which breaks the logic of the runner determinator. In case of tag triggers, extract the pr number from the tag name, fetch the pr and extract the user login from it. Both the inline and standalone python scripts have been updated for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129500 Approved by: https://github.com/malfet, https://github.com/zxiiro	2024-06-26 20:27:06 +00:00
James Perng	d3d6764082	[pytorch][logging] add fb internal ODS implementation of wait counter (#128605 ) * created fb internal implementation in `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` * uses `facebook::data_preproc::WaitCounterUs` under the hood by having `WaitCounterImpl` trivially subclass it. * this makes `WaitCounterHandle` a glorified pointer to `facebook::data_preproc::WaitCounterUs` which is statically defined in the `STATIC_WAIT_COUNTER` macro making these pointers Meyer's singletons. * `facebook::data_preproc::WaitCounterUs` uses 3 singletons: 1. `std::unique_ptr<DynamicCounter::State>` map — leaky singleton 2. `std::weak_ptr<WaitCounterUs::State>` map — leaky singleton 3. publisherSingleton — normal singleton since it manages resources (threads) * `facebook::data_preproc::WaitCounterUs` actually owns shared pointers to the state and its destructor will remove it from the `std::weak_ptr<WaitCounterUs::State>` map when the reference count for the state hits 0. * linked `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` and added `//data_preproc/common:counters` (dpp dependency) to `caffe2/fb/fbcode/target_definitions.bzl` * wrapped OSS null implementation in `#ifndef FBCODE_CAFFE2` so that internally we use the fb internal implementation. as a follow-up I might move the counter implementation out of the data_preproc/counters library to a more common ai infra library? Differential Revision: [D58458751](https://our.internmc.facebook.com/intern/diff/D58458751/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128605 Approved by: https://github.com/c-p-i-o ghstack dependencies: #128466	2024-06-26 19:11:21 +00:00
Catherine Lee	90f82426b9	RS migration - trymerge to upload merge records to s3 (#129503 ) Uploads merge records to to ossci-raw-job-status (public) bucket instead of directly to rockset The runner used by trymerge is a GH runner, so it doesn't have access to s3. Instead, I save the record as a json and upload the json to s3 in a different step that runs after the aws credentials are configured. The role is defined [here](https://togithub.com/pytorch-labs/pytorch-gha-infra/pull/421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129503 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet	2024-06-26 19:06:52 +00:00
PyTorch MergeBot	895316119d	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 0314c4c101c44d5d89b4fad9d37a012dc6f31128. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))	2024-06-26 19:03:57 +00:00
PyTorch MergeBot	e9aefad641	Revert "[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 )" This reverts commit 551e4127185195ae8a5331dc8bbfdffd5d4dd1b8. Reverted https://github.com/pytorch/pytorch/pull/128423 on behalf of https://github.com/nWEIdia due to Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/128423#issuecomment-2192423840))	2024-06-26 18:54:41 +00:00
Shangdi Yu	cca85c96cd	[export] minor typo fix (#129543 ) Fixes a typo in torch.export doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129543 Approved by: https://github.com/angelayi	2024-06-26 18:35:31 +00:00
Sam Larsen	87d14ad419	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-26 18:34:48 +00:00
Huy Do	61bf1452a3	Add one more shard for CPU jobs (#129299 ) The first shard is very close to 3.5h and timeout sometimes now `1c75ddff35 (26540310592)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129299 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-06-26 18:32:10 +00:00
Andres Lugo	b9a1c2c991	[ROCm] Enable F8 Inductor Unit tests (#128353 ) First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-26 18:30:43 +00:00
Saurabh Mishra	8e4f7f742f	[DCP] Capture reader, writer and planner components in the DCP API logger (#129548 ) Summary: Capture reader, writer and planner components in the DCP API logger Test Plan: logs can be found in scuba pytorch_dcp_logging https://fburl.com/scuba/pytorch_dcp_logging/ruqez1ki Differential Revision: D59040866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129548 Approved by: https://github.com/wz337, https://github.com/fegin	2024-06-26 18:11:16 +00:00
Isuru Fernando	7373492c9b	Use _unsafe_masked_index in masked_scatter decomposition (#123667 ) and remove masked_scatter_with_index inductor prims Pull Request resolved: https://github.com/pytorch/pytorch/pull/123667 Approved by: https://github.com/peterbell10	2024-06-26 17:18:24 +00:00
Jack Taylor	1b1fd0f4fe	[ROCm] Use additional shard for inductor workflow to resolve timeouts (#129480 ) This will help timeouts on inductor workflow. The cuda equivalent job also moved to 2 shards since `e0aa992d73` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129480 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet	2024-06-26 17:18:20 +00:00
Nikita Shulga	bc68907caa	[EZ][BE] Replace `assertTrue` with more appropriate checks (#129569 ) Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e. - `assertTrue(x == y)` -> `assertEqual(x, y) - `assertTrue(not x)` -> assertFalse(x)` - `assertTrue(x > y)` -> assertGreater(x, y)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007	2024-06-26 16:29:59 +00:00
Piotr Kluska	9cf8e5dd32	chore(quantization): Enable PT2E symmetric dynamic quantization (#124615 ) in `_find_choose_qparams_node` function compare the current node if it is affine or symmetric Pull Request resolved: https://github.com/pytorch/pytorch/pull/124615 Approved by: https://github.com/kimishpatel, https://github.com/malfet	2024-06-26 16:14:58 +00:00
PyTorch MergeBot	f7708ffebb	Revert "[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 )" This reverts commit 52009068bc39ebc846bd37b44f5f9c5f62257778. Reverted https://github.com/pytorch/pytorch/pull/129378 on behalf of https://github.com/clee2000 due to broke inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_sympy_expr_arg_abi_compatible_cuda and a few other tests https://github.com/pytorch/pytorch/actions/runs/9680978494/job/26713689249 `52009068bc`. The tests were added in https://github.com/pytorch/pytorch/pull/129301 which is before your base ([comment](https://github.com/pytorch/pytorch/pull/129378#issuecomment-2192032697))	2024-06-26 15:46:17 +00:00
Xu Zhao	474d743dba	[torchao][benchmark] Skip all accuracy tests by returning `pass_due_to_skip` (#129545 ) Summary: As the title says. Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy ``` Differential Revision: D59040593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545 Approved by: https://github.com/HDCharles	2024-06-26 14:21:53 +00:00
Mikayla Gawarecki	25cec43678	Remove dependency on private _compat_pickle in CPython (#129509 ) Use the IMPORT_MAPPING and NAME_MAPPING from here https://github.com/python/cpython/blob/main/Lib/_compat_pickle.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129509 Approved by: https://github.com/malfet ghstack dependencies: #129239, #129396	2024-06-26 14:20:27 +00:00
Mikayla Gawarecki	3b531eace7	Add example for torch.serialization.add_safe_globals (#129396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396 Approved by: https://github.com/albanD, https://github.com/malfet ghstack dependencies: #129239	2024-06-26 14:20:27 +00:00
Mikayla Gawarecki	303ad8d7f5	Add warning for weights_only (#129239 ) Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239 Approved by: https://github.com/albanD, https://github.com/malfet	2024-06-26 14:20:19 +00:00
Bin Bao	52009068bc	[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 ) Summary: Unify the UserDefinedTritonKernel argument codegen logic between python wrapper and cpp wrapper. This prepares for later PRs that will simplify AOTI codegen. Differential Revision: [D59002226](https://our.internmc.facebook.com/intern/diff/D59002226) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129378 Approved by: https://github.com/oulgen, https://github.com/chenyang78 ghstack dependencies: #129267	2024-06-26 13:53:27 +00:00
Bin Bao	42d490d41d	[AOTI][refactor] Move generate_user_defined_triton_kernel (#129267 ) Summary: Move generate_user_defined_triton_kernel from cpp_wrapper_cpu to cpp_wrapper_cuda as it's for CUDA only Differential Revision: [D58953005](https://our.internmc.facebook.com/intern/diff/D58953005) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129267 Approved by: https://github.com/chenyang78	2024-06-26 13:50:39 +00:00
Jean Schmidt	53fafdd0c3	[BE] Runner determinator: more resilient user matching (#129462 ) Small improvements on runner determinator script: * Don't do splitting of the issue comment, unless necessary; * Match username against a set over a list; * Match both triggering_actor and issue owner over only actor (to avoid edge cases, where we get `pytorch-bot[bot]`) * Add stripping, to remove potential breaking and not visible whitespaces; * Don't use linux.4xlarge as a runner: it should not depend on meta runners, for reliability; Pull Request resolved: https://github.com/pytorch/pytorch/pull/129462 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi	2024-06-26 13:47:52 +00:00
PyTorch MergeBot	211f38e742	Revert "[ALI] [Reland] Use LF runners for Lint (#129071 )" This reverts commit 1b92bdd0ea326cd30bc3945602701ffe28c85fd5. Reverted https://github.com/pytorch/pytorch/pull/129071 on behalf of https://github.com/malfet due to All LF jobs are backlogged, so revert this one ([comment](https://github.com/pytorch/pytorch/pull/129071#issuecomment-2191676677))	2024-06-26 13:19:00 +00:00
Yifu Wang	92be3403ea	Fix an issue in oneShotAllReduce where different ranks perform reduction in different order (#129501 ) In `oneShotAllReduce`, ranks read data from peers in a round-robin fashion to load-balance NVLinks. However, the following reduction is also performed in the this order which is different across ranks. This can results in slight numerical differences across ranks, which can lead to a hang in data dependent applications like speculative decoding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129501 Approved by: https://github.com/Chillee	2024-06-26 08:43:10 +00:00
Animesh Jain	f2840bb220	[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 ) TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow. With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model. Functionality impact - The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR). Perf impact - I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). Typing impact - I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #129163	2024-06-26 07:59:42 +00:00
Will Feng	ead97ee486	[Compile+SAC] Only warn for in-place ops once (#129397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129397 Approved by: https://github.com/tianyu-l	2024-06-26 07:25:02 +00:00
cdzhan	c422a9549d	[easy][DCP] Fix test_fsdp_ep.py for _MeshEnv.create_child_mesh API ch… (#129445 ) …ange Update test/distributed/checkpoint/e2e/test_fsdp_ep.py for #127465 change. Failure info: ```bash [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Caught exception: [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Traceback (most recent call last): [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 657, in run_test [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] getattr(self, test_name)() [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 539, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] fn() [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_utils.py", line 2744, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] method(args, kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 369, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] func(self, args, *kwargs) # type: ignore[misc] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 180, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] return func(args, *kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/checkpoint_utils.py", line 44, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] func(self, args, **kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 76, in test_e2e [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, 0, "dp") [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] TypeError: _MeshEnv.create_child_mesh() takes 3 positional arguments but 4 were given [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] To execute this test, run the following from the base repo dir: [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] python test/distributed/checkpoint/e2e/test_fsdp_ep.py -k TestFSDPWithEP.test_e2e [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129445 Approved by: https://github.com/fegin, https://github.com/wz337	2024-06-26 06:43:30 +00:00
wz337	8b8e2fcdda	[DCP] Fix Optimizer Learning Rate not being loaded correctly (#129398 ) Fixes #129079 Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after `f518cf811d`, which is a regression introduced in 2.3. This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading. Test: ``` # test to make sure lr is loading correctly python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict # test to make sure load on meta device model still works python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129398 Approved by: https://github.com/fegin	2024-06-26 06:41:47 +00:00
Sheng Fu	000f2d637b	Refactoring the code to make it lint clean (#129424 ) Summary: Refactoring the code to make it lint clean Test Plan: buck2 build mode/dev-tsan caffe2/test:test_profiler_cuda Differential Revision: D58971175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129424 Approved by: https://github.com/aaronenyeshi	2024-06-26 06:12:01 +00:00
Li-Huai (Allan) Lin	610894e978	[MPS][BE] Generalize Fused optimizers (#129105 ) This PR generalizes the multi_tensor_apply function for other fused optimizers Pull Request resolved: https://github.com/pytorch/pytorch/pull/129105 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008, #129007	2024-06-26 06:00:41 +00:00
Pian Pawakapan	d02bba519c	[export] match fake mode for _decompose_exported_program() (#129421 ) Summary: _decompose_exported_program() ran into an issue with trace_joint, where trace_joint() produces values with mismatching FakeModes. Adding fake mode context to aot_export_module() so this doesn't happen. #thanks to tugsbayasgalan for the fix! Test Plan: test_experimental Differential Revision: D58977694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129421 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2024-06-26 05:52:31 +00:00
Chien-Chin Huang	7420bad74c	[BE] Do not assert if the barrier is not created (#129497 ) the foler will be created as long as TEMP_DIR is set and the program has the write permission. This will ensure some test environment can run the spawn tests. Differential Revision: [D59020736](https://our.internmc.facebook.com/intern/diff/D59020736/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129497 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-06-26 05:51:36 +00:00
Anshul Sinha	c04cec609d	[dtensor][debug] fixing CommDebugMode module collective tracing (#128887 ) Summary The logic for CommDebugMode module collective tracing is incorrect as it only worked for leaf module nodes on the model's module tree. If we had a sub-module that had a collective call along with a nested module inside it, the sub-module was not removed from the module_tracker parent set leading to double-counting collectives. This problem was addressed by checking to make sure the current sub-module was not already in the parent set. The output of the below test cases should remain the same. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/128887 Approved by: https://github.com/XilunWu ghstack dependencies: #128729	2024-06-26 05:25:57 +00:00
Anshul Sinha	bd3a11776f	[dtensor][test] test case suite for comm_mode features (#128729 ) Summary Currently, there is only an example file for comm_mode and its features. I have created test cases that mirror the examples while the more complicated test cases also ensure that comm_mode resets all variables when used multiple times in the same function. This test case suite will also help developers ensure that new code they add to comm_mode does not affect correctness of old features. #128536 Test Plan pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128729 Approved by: https://github.com/XilunWu	2024-06-26 05:25:57 +00:00
Tugsbayasgalan Manlaibaatar	6181e65cd8	Nested tensor subclass support (#127431 ) When we have nested tensor subclasses, we need to recursively flatten/unflatten in Fake tensor creation and AOTAUtograd. Most of the PR is about mechanical change which changes today's single level flatten logic to be recursive. Differential Revision: [D58533224](https://our.internmc.facebook.com/intern/diff/D58533224) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127431 Approved by: https://github.com/bdhirsh	2024-06-26 04:45:22 +00:00
Huy Do	cda4d4887d	Skip signals from older runs of the same workflows (#129291 ) I discovered this bug in trymerge when debugging https://github.com/pytorch/pytorch/pull/129013 in which Dr.CI reported no relevant failures while mergebot complained about some unrelated ROCm failures https://github.com/pytorch/pytorch/pull/129013#issuecomment-2183009217. It turns out that mergebot took into account stale signals from older runs of the same workflow here. For example, * https://github.com/pytorch/pytorch/actions/runs/9604985361 was the first run where it had a ROCm failure * While https://github.com/pytorch/pytorch/actions/runs/9608926565 was the second attempt and it was all green Notice that both runs came from the same push to commit [be69191](`be69191f2d`) with [ciflow/rocm/129013](https://github.com/pytorch/pytorch/tree/ciflow/rocm/129013). So, we just need to check the signals from the newer run. Note that Dr.CI handles this part correctly using the logic in https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/drci/drci.ts#L1079-L1088. So, the fix in this PR is to bring the same logic to trymerge. ### Testing `pytest -v test_trymerge.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129291 Approved by: https://github.com/ZainRizvi	2024-06-26 03:49:09 +00:00
James Perng	c718e2f43b	[pytorch][logging] add empty wait counter implementation (#128466 ) Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128466 Approved by: https://github.com/c-p-i-o	2024-06-26 03:47:17 +00:00
xinan.lin	54f27b886e	[Inductor UT] Reuse test_distributed_patterns.py for Intel GPU (#129437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129437 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-06-26 02:58:45 +00:00
CaoE	555f71a15b	Fix test_auto_simd in machine with AMX support (#129444 ) Fixes #129438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129444 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-06-26 02:50:55 +00:00
cdzhan	a89a1ed072	[easy][DCP] make BroadcastingTorchSaveReader device generic (#129231 ) Test test/distributed/checkpoint/test_format_utils.py on GPU and othor device pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129231 Approved by: https://github.com/fegin	2024-06-26 02:37:30 +00:00
Peter Bell	90d5a6f001	[inductor] Add lowering and codegen for aten.sort (#128458 ) Closes #125633 Benchmarks: \| Shape \| dim \| stable \| compiled \| eager \| speedup \| \|-------------\|-----\|--------\|----------\|---------\|---------\| \| (256, 4096) \| 0 \| False \| 0.73 ms \| 1.26 ms \| 1.7 \| \| (256, 4096) \| 0 \| True \| 0.75 ms \| 1.27 ms \| 1.7 \| \| (4096, 256) \| 1 \| False \| 0.20 ms \| 0.73 ms \| 3.7 \| \| (4096, 256) \| 1 \| True \| 0.21 ms \| 0.73 ms \| 3.5 \| \| (255, 4096) \| 0 \| False \| 1.05 ms \| 1.48 ms \| 1.4 \| \| (255, 4096) \| 0 \| True \| 1.03 ms \| 1.47 ms \| 1.4 \| \| (4096, 255) \| 1 \| False \| 0.52 ms \| 0.98 ms \| 1.9 \| \| (4096, 255) \| 1 \| True \| 0.54 ms \| 1.00 ms \| 1.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458 Approved by: https://github.com/lezcano, https://github.com/eellison	2024-06-26 01:36:39 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
Yanbo Liang	9554a9af87	[GPT-benchmark] Distinguish LLM models and mirco-benchmarks (#129498 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129498 Approved by: https://github.com/huydhn	2024-06-26 00:25:05 +00:00
Catherine Lee	0d0d42c4a7	test_qat_mobilenet_v2 succeeding on dynamo (#129532 ) https://github.com/pytorch/pytorch/actions/runs/9669572961/job/26677024995 Test is usually marked as slow so it doesn't get run on dynamo since dynamo doesn't have a slow equivalent However, it is succeeding, so we might as well as do what the logs tell us to do and remove the failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/129532 Approved by: https://github.com/malfet, https://github.com/kit1980	2024-06-25 23:55:12 +00:00
Peter Bell	112ef79f29	[inductor] Remove comm-specific node attributes from scheduler (#129084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129084 Approved by: https://github.com/lezcano	2024-06-25 23:52:19 +00:00
wz337	d1f9e822dd	[DTensor][Test] Update implicit replication unit tests for tensor arg being the first in args list (#127803 ) Change the operands order so we can have test coverage for when the first arg is a tensor arg instead of DTensor arg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127803 Approved by: https://github.com/XilunWu	2024-06-25 23:51:58 +00:00
Will Feng	575bc1e3af	[Reopen #114036 ] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295 Approved by: https://github.com/Chillee	2024-06-25 23:47:08 +00:00
joydddd	f389541ce0	Add Strided Input test for flex attention (#128915 ) Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-06-25 23:26:34 +00:00
Catherine Lee	87ebd627a7	RS migration - upload sccache stats to s3 instead of rockset (#129490 ) Upload sccache stats to s3 instead of rockset I don't think we use these anywhere, so it's ok to cut off the ingest into rockset right now. We should consider deleting this entirely if we don't plan on using it I will work on copying existing data over from rockset to s3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129490 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-06-25 23:23:16 +00:00
PyTorch MergeBot	52341c28e8	Revert "[FSDP2] Ran post-acc-grad hooks manually (#129450 )" This reverts commit 7ebffef4d02a3cc68dbbcf44b92d63c7fe0ebb67. Reverted https://github.com/pytorch/pytorch/pull/129450 on behalf of https://github.com/clee2000 due to broke distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager `7ebffef4d0` https://github.com/pytorch/pytorch/actions/runs/9667812641/job/26671489454. Test got added in https://github.com/pytorch/pytorch/pull/129157 which is before your mergebase ([comment](https://github.com/pytorch/pytorch/pull/129450#issuecomment-2190174363))	2024-06-25 23:13:57 +00:00
Yifu Wang	bbd47f7b2f	Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762 ) This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762 Approved by: https://github.com/wanchaol	2024-06-25 22:32:21 +00:00
Chien-Chin Huang	1c5df9107d	[BE] Fix several incorrect skip tests (#129488 ) These tests may not be skipped properly if NCCL library exists but CUDA is not avaiable. Differential Revision: [D59013855](https://our.internmc.facebook.com/intern/diff/D59013855/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129488 Approved by: https://github.com/wz337, https://github.com/fduwjj	2024-06-25 22:10:31 +00:00
Shunting Zhang	fd414d6189	[inductor] don't materialize the large sparse matrix in CE bwd (#129043 ) Inductor currently materialize a large sparse matrix in the backward pass for CrossEntropyLoss and load that to compute gradients of Softmax input. If we could fuse the sparse matrix computation to the consumer sides, we gonna have both perf and memory usage wins. The Fx graph snippets that construct this aforementioned sparse matrix looks like: ``` full_default_3: "bf16[32768, 50257]" = torch.ops.aten.full.default([32768, 50257], 0, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0), pin_memory = False) scatter: "bf16[32768, 50257]" = torch.ops.aten.scatter.value(full_default_3, 1, where_2, -1.0); full_default_3 = where_2 = None ``` Leveraging the following observations: - the scatter is applied upon a all zero (or more generally a const tensor) - the index tensor for the scatter has a single element on the scatter dimension. In this case it's the label tensor allow us to lower this 'scatter_upon_const_tensor' pattern to a pointwise kernel that can be easily fused with downstream kernels: ``` def inner_fn(idx): selector_idx = list(idx) selector_idx[dim] = 0 # can do this since the index tensor has a single element on the scatter dimension selector = selector_loader(selector_idx) return ops.where( selector == ops.index_expr(idx[dim], torch.int64), ops.constant(val, dtype), ops.constant(background_val, dtype), ) ``` ## Test result on microbenchmark For the microbenchmark added as `test_cross_entropy_loss`, we improve latency from 47.340ms to 42.768ms, memory footprint from 10.524GB to 7.227GB on A100. (on H100, we improve latency from 27.54ms to 23.51ms, memory footprint from 10.574GB to 7.354GB). The saving matches the back-of-envelope calculation. We avoid storing a BF16 tensor with shape [30K, 50K] which is about 3GB in size. On A100, avoid loading and storing such a tensor can roughly save 3GB x 2 / 1.5TBGS = 4ms ## Test result on llm.c We also test this on llm.c and the saving is much larger especially for memory footprint. The reason is due to autotuning that allocates extra memory for benchmarking. (Check https://github.com/pytorch/pytorch/issues/129258 and https://github.com/pytorch/pytorch/pull/129399 for more details). For llm.c PyTorch implementation on A100, we improve from 171K tokens/s , 33.6G peak memory usage to 180K tokens/s, 18.6G peak memory usage. (A 45% saving of peak memory) ## Test on PyTorch 2.0 Dashboard The optimization is quite general especially for transformers. We tested this on PyTorch2.0 dashboard. Here is the [result](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2017%20Jun%202024%2018%3A07%3A51%20GMT&stopTime=Mon%2C%2024%20Jun%202024%2018%3A07%3A51%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/158/head&lCommit=c62c55e29c65497d495217b6574bb36b0c4da7d4&rBranch=main&rCommit=0d25f096c1beaf8749932a3d6083ad653405ed71). TLDR, for Huggingface benchmark suite, we get 6% geomean perf improvement and 10% geomean memory footprint improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129043 Approved by: https://github.com/jansel, https://github.com/Chillee	2024-06-25 21:25:50 +00:00
Will Constable	e1499f6342	[C10D] Make new_group eager when used with comm_split (#129284 ) If users pass `device_id` to init_process_group, they enable eager init for the default group. Then if they subsequently call `new_group`, the device_id argument is not required as it should be assumed to match the one used for init_process_group. However, both `init_process_group` and `new_group` apis share a helper function, which expects a `device_id` value that defaults to None. When it's None, eager initialization is disabled. This PR ensures that if a device_id was passed to init_process_group, the same device_id will automatically be fed into the helper function for any new_group calls that follow. Test plan I found an existing test in CI `test_comm_split_subgroup` that failed after my change, because it was asserting that backend comm_split counter did not increment eagerly, and its behavior had changed to increment eagerly. I updated the test in the PR to pass with my change. I also tested locally via simple program with TORCH_CPP_LOG_LEVEL=INFO and observed eager initialization of the 'lows' and 'highs' PGs before the 'Here' print. ``` import torch import torch.distributed as dist dist.init_process_group(backend="nccl", device_id =torch.device(f"cuda:{torch.distributed.get_node_local_rank(0)}")) dist.new_group([0, 1], group_desc="lows") dist.new_group([2, 3], group_desc="highs") print("Here") torch.distributed.destroy_process_group() ``` Output: https://gist.github.com/wconstab/88a5ba0b970244ca1f79133f989e0349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129284 Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj, https://github.com/d4l3k, https://github.com/nvcastet	2024-06-25 21:09:34 +00:00
Zhengxu Chen	e58ef5b65f	[export] Rewrite exportdb formatting. (#129260 ) Summary: It'll be easier to generate examples if the code doesn't depend on exportdb library. Test Plan: CI Differential Revision: D58886554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129260 Approved by: https://github.com/tugsbayasgalan	2024-06-25 21:04:53 +00:00
Wei Wang	551e412718	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-25 20:59:49 +00:00
Max Podkorytov	79959d707c	[Inductor][ROCm] Composable Kernel backend for Inductor (#125453 ) This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection. The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are (1) customizing compiler for the ROCm platform (2) customizing template code generation for Composable Kernel Universal GEMM instances. We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance. ### Testing Install the ck library ``` pip install git+https://github.com/rocm/composable_kernel@develop ``` Run the test ``` TORCH_LOGS=+torch._inductor \ pytest --capture=tee-sys test/inductor/test_ck_backend.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453 Approved by: https://github.com/eellison, https://github.com/jansel	2024-06-25 20:54:14 +00:00
DiweiSun	ae0f84d89c	[CI] Enable amp accuracy check for inductor cpu (#127758 ) This is to enable inductor AMP accuracy check for on CPU in CI workflow to capture issue early. Three suites are included: timms, huggingface as well as torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127758 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-25 20:34:18 +00:00
Jiashen Cao	45f2876934	[Fix] NumToTensor resulted from numel() and size() in TSCovnerter (#128761 ) #### Issue In jit.trace, torch.numel() is automatically cast to a `LongTensor`. But during conversion, we lost the casting part. `prim::NumToTensor` was previously converted to `torch.ops.aten.scalar_tensor`, which uses the same `dtype` as the input tensor instead of `LongTensor`. in this PR, we add a casting to convert it to the correct `dtype`. #### Test Plan We activate previously failing test case. * `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128761 Approved by: https://github.com/angelayi	2024-06-25 20:20:03 +00:00
Jeff Daily	e68ee2cadb	TunableOp hotfix (#129281 ) Fixes. - PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 had a memory leak. - The strided batched gemm size calculation for buffer rotation was incorrect resulting in a mem fault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129281 Approved by: https://github.com/xw285cornell, https://github.com/eqy, https://github.com/mxz297	2024-06-25 20:12:46 +00:00
Chirag Pandya	1865fe282f	Log whenever we sleep (#129197 ) Summary: Log whenever we sleep for heartbeatTimeout. Useful for debugging stuck jobs. This will eventually turn into a metric. Test Plan: none. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129197 Approved by: https://github.com/Skylion007, https://github.com/d4l3k, https://github.com/wconstab	2024-06-25 20:09:41 +00:00
PyTorch MergeBot	b1f486aff9	Revert "Add warning for weights_only (#129239 )" This reverts commit 381ce0821c3fa2b342f0b8660c76cc27f48543c4. Reverted https://github.com/pytorch/pytorch/pull/129239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm `381ce0821c`, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))	2024-06-25 19:30:07 +00:00
PyTorch MergeBot	7cf454ec52	Revert "Add example for torch.serialization.add_safe_globals (#129396 )" This reverts commit f18becaaf1c7a7bf851e3ae8d215eee8dba688b6. Reverted https://github.com/pytorch/pytorch/pull/129396 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm `381ce0821c`, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))	2024-06-25 19:30:07 +00:00
Tristan Rice	0298560ca2	TCPStore: improve connect and retry logic (#129261 ) We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times. This PR does a few things: * Retry that connect and validate up to the specified timeout. * Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep. * Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141 Test plan: ``` python test/distributed/test_store.py -v ./build/bin/BackoffTest ``` Will do internal testing with some large scale jobs to ensure TCPStore works correctly. At 4k scale: 4x improvement ``` tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (pytorch-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 1.98 secs fish external usr time 0.93 secs 91.00 micros 0.93 secs sys time 1.98 secs 954.00 micros 1.97 secs tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10 (pytorch-3.10) tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (torchdrive-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 8.20 secs fish external usr time 2.15 secs 0.00 micros 2.15 secs sys time 2.76 secs 843.00 micros 2.76 secs ``` ```py import time import os import threading from multiprocessing import Pool WORLD_SIZE = 10000 import torch.distributed as dist def run(rank): should_log = rank % (WORLD_SIZE // 10) == 0 if should_log: print(f"started {rank}") store = dist.TCPStore( host_name="devvm4382.nao0.facebook.com", port=29500, world_size=WORLD_SIZE, is_master=rank == 0, use_libuv=True, ) if should_log: print(f"init {rank}") store.set(f"key{rank}", "1234") if should_log: print(f"set {rank}") del store def noop(rank): pass print("starting pool") with Pool(WORLD_SIZE) as pool: pool.map(noop, range(WORLD_SIZE), 1) print("pool hot") start = time.time() pool.map(run, range(WORLD_SIZE), 1) print("run finished", time.time()-start) ``` ``` tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py (pytorch-3.10) starting pool pool hot started 0 [W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. started 1000 init 1000 set 1000 started 2000 init 2000 set 2000 started 3000 init 3000 set 3000 started 4000 init 4000 set 4000 started 5000 init 5000 set 5000 started 6000 init 6000 set 6000 started 7000 init 7000 set 7000 started 8000 init 8000 set 8000 started 9000 init 9000 set 9000 init 0 set 0 run finished 0.705092191696167 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261 Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o	2024-06-25 19:24:22 +00:00
Nikita Shulga	816e8a3f21	[MacOS] Improve libomp packaging (#129473 ) Instead of replacing `@rpath/libomp.dylib` with `@loadper_path/libomp.dylib`, keep it in place and add `@loadper_path` as new rpath This should prevent double-loading of OpenMP runtime, because in case of `@rpath` loader is allowed to reuse other libraries, but `loadper_path` directive forces it to load it from the location relative to the executable Test plan: - Prepare the environment ```shell conda create -n py310-cf python=3.10 numpy pip -c conda-forge conda activate py310-cf pip install torch --index-url https://download.pytorch.org/whl/test/cpu ``` - Verify that OpenMP is loaded twice and than crashes ```shell KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())" ``` output: ``` LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 16.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 12.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no 2.4.0 True zsh: segmentation fault KMP_VERSION=true python -c ``` - Install artifact from this PR and make sure it passes the same test ```shell python -mpip install ~/Downloads/torch-2.5.0.dev20240625-cp310-none-macosx_11_0_arm64.whl KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())" ``` output ``` LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 16.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no 2.5.0.dev20240625 True tensor(1.0000) ``` - Make sure it still uses bundled OpenMP if none is available in the environment ``` conda uninstall numpy -c conda-forge KMP_VERSION=true python -c "from ctypes import cdll, c_char_p, c_uint32; import torch; from ctypes import cdll, c_char_p, c_uint32; libdyld = cdll.LoadLibrary('libSystem.dylib'); libdyld._dyld_image_count.restype = c_uint32; libdyld._dyld_get_image_name.restype = c_char_p; libdyld._dyld_get_image_name.argtypes = [c_uint32]; print(torch.rand(300, 300).abs().max()); libs = [libdyld._dyld_get_image_name(i).decode('ascii') for i in range(libdyld._dyld_image_count())]; print([l for l in libs if 'libomp.dylib' in l])" ``` Fixes https://github.com/pytorch/pytorch/issues/124497 and https://github.com/pytorch/pytorch/issues/126385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129473 Approved by: https://github.com/atalman	2024-06-25 19:12:34 +00:00
PyTorch MergeBot	b045878f81	Revert "Remove test_mps_allocator_module XFAIL (#129340 )" This reverts commit c888ee36325148ed99db4298bf2ae739ebbeacdc. Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))	2024-06-25 18:37:54 +00:00
Andrew Gu	7ebffef4d0	[FSDP2] Ran post-acc-grad hooks manually (#129450 ) FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually. Discussion Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity. Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not. Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually. Caveats - Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass). - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be. - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers. - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`. - The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream. - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues. - This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope. Experiments (torchtitan) - Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision: - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped) - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450 Approved by: https://github.com/weifengpy	2024-06-25 18:34:56 +00:00
Yidi Wu	dd00f5e78d	Fixes T192448049 (#129146 ) Differential Revision: D58767610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129146 Approved by: https://github.com/angelayi	2024-06-25 17:50:15 +00:00
Weizhuo Zhang	53f462c506	Write dynamo benchmarks performance result to csv when throw exceptions (#126764 ) Performance mode Issue: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files. ![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c) - Fix: The warm-up failed models will be recorded into csv file shown as following: ![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117) Accuracy mode issue: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR `ee557d8f61`. ``` Dynamic Shape: Traceback (most recent call last): File "benchmarks/dynamo/torchbench.py", line 449, in <module> torchbench_main() File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main main(TorchBenchmarkRunner(), original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main process_entry(0, runner, original_dir, args) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry return run(runner, args, original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` ![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a) - Fix: same as PR `ee557d8f61`, the batch_size will be skipped to set as 4 when testing dynamic shapes. Dynamic shapes passrate improved from 89% -> 95% \| Comp Item \| Compiler \| suite \| before \| After fix \| \|-----------\|----------\|------------\|------------\|------------\| \| Pass Rate \| Inductor \| torchbench \| 89%, 73/82 \| 95%, 79/83 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764 Approved by: https://github.com/jansel	2024-06-25 17:49:04 +00:00
atalman	e317a8b264	Add guard to use AMX for x86_64 only (#129479 ) Trying to mitigate aarch64 and s390 nightly failures as per this comment: https://github.com/pytorch/pytorch/pull/127195#issuecomment-2189177949 Fixes https://github.com/pytorch/pytorch/issues/129443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129479 Approved by: https://github.com/nWEIdia, https://github.com/malfet	2024-06-25 17:31:28 +00:00
PyTorch MergeBot	45b2931b7e	Revert "[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 )" This reverts commit b24787b7576c184a54d13c1833ada23a395f5c31. Reverted https://github.com/pytorch/pytorch/pull/129414 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures. Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))	2024-06-25 17:05:55 +00:00
PyTorch MergeBot	fb40ba6fc2	Revert "[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 )" This reverts commit aa4ee2cb9e1f9be6bbdd27654e0f768b7fe9be6c. Reverted https://github.com/pytorch/pytorch/pull/127247 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures. Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))	2024-06-25 17:05:55 +00:00
PyTorch MergeBot	ad76da6c16	Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 )" This reverts commit 7b57ddd38c6d502ba313c0e6b0c92b6787d69986. Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 `4c1e4c5f30`, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))	2024-06-25 16:48:32 +00:00
PyTorch MergeBot	b38f6d4cd2	Revert "[inductor] Enable FX graph caching in OSS by default (#125863 )" This reverts commit 4c1e4c5f307f9743014a08cf97d3fa8de7e1ce5f. Reverted https://github.com/pytorch/pytorch/pull/125863 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 `4c1e4c5f30`, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))	2024-06-25 16:48:32 +00:00
vinithakv	f8db12a538	Fix logic to find sbgemm in BLAS library (#125227 ) Current logic to set the HAS_SBGEMM flag is ignored in case the BLAS libraries are found already, ie, if set from environment variable BLAS=OpenBLAS . If BLAS_LIBRARIES are already set the code to find if BLAS_LIBRARY has sbgemm is never executed. The following commit brings out this logic outside unconditionally. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125227 Approved by: https://github.com/malfet	2024-06-25 16:34:38 +00:00
Zhengxu Chen	665d6ea05b	[export] Fix IR canonlization. (#129401 ) Summary: as title. we should unpack results from _canonicalize_graph. Differential Revision: D58963429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129401 Approved by: https://github.com/tugsbayasgalan	2024-06-25 16:33:02 +00:00
Joel Schlosser	e364290718	Support linear backward for NJT with dim > 3 (#129393 ) Replaces usage of `torch.mm()` with `torch.matmul()` in NJT's impl of linear_backward to support higher dims. See [here](https://github.com/pytorch/pytorch/issues/125214#issuecomment-2184968703) for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129393 Approved by: https://github.com/soulitzer	2024-06-25 16:06:23 +00:00
Klein Shen	0e6bb7f1ce	[caffe2][be] migrate gloabl static initializer (#128784 ) Summary: Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154. This Diff migrate StorageImpl.cpp Addtional Context: https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154 Test Plan: CI Differential Revision: D58639283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128784 Approved by: https://github.com/aaronenyeshi	2024-06-25 15:30:49 +00:00
Nikita Shulga	fd4af87855	Fix non-portable path warning (#129474 ) MacOS uses case-insensitive filesystem by default, but it's better to specify include path using proper capitalization Should fix ``` MultiTensorApply.h:4:10: warning: non-portable path to file '<ATen/native/mps/operations/FusedOptimizerOps.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path] #include <Aten/native/mps/operations/FusedOptimizerOps.h> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129474 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/qqaatw	2024-06-25 15:17:21 +00:00
drisspg	cb1c56caba	Set target dependencies to always build for sm90a on rowwise scaling (#129402 ) # Summary Instead of landing global builder changes; https://github.com/pytorch/builder/pull/1878 This PR targets only the Rowwise file and adds the sm90a featurs. Verified locally by setting: ``` TORCH_CUDA_ARCH_LIST=9.0 ``` We can see in the build.ninja file that the proper flags are set: ``` build caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o: CUDA_COMPILER__torch_cuda_unscanned_Release /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu \|\| cmake_object_order_depends_target_torch_cuda DEFINES = -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS DEP_FILE = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o.d FLAGS = -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-unused-function,-Wno-maybe-uninitialized -Wno-deprecated-copy -gencode arch=compute_90a,code=sm_90a INCLUDES = -I/home/drisspg/meta/pytorch/build/aten/src -I/home/drisspg/meta/pytorch/aten/src -I/home/drisspg/meta/pytorch/build -I/home/drisspg/meta/pytorch -I/home/drisspg/meta/pytorch/third_party/onnx -I/home/drisspg/meta/pytorch/build/third_party/onnx -I/home/drisspg/meta/pytorch/third_party/foxi -I/home/drisspg/meta/pytorch/build/third_party/foxi -I/home/drisspg/meta/pytorch/aten/src/THC -I/home/drisspg/meta/pytorch/aten/src/ATen/cuda -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/drisspg/meta/pytorch/build/caffe2/aten/src -I/home/drisspg/meta/pytorch/aten/src/ATen/.. -I/home/drisspg/meta/pytorch/build/nccl/include -I/home/drisspg/meta/pytorch/c10/cuda/../.. -I/home/drisspg/meta/pytorch/c10/.. -I/home/drisspg/meta/pytorch/third_party/tensorpipe -I/home/drisspg/meta/pytorch/build/third_party/tensorpipe -I/home/drisspg/meta/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/drisspg/meta/pytorch/torch/csrc/api -I/home/drisspg/meta/pytorch/torch/csrc/api/include -isystem /home/drisspg/meta/pytorch/build/third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/drisspg/meta/pytorch/third_party/protobuf/src -isystem /home/drisspg/meta/pytorch/third_party/ittapi/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.3/include -isystem /home/drisspg/meta/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/drisspg/meta/pytorch/third_party/ideep/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/cudnn_frontend/include OBJECT_DIR = caffe2/CMakeFiles/torch_cuda.dir OBJECT_FILE_DIR = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129402 Approved by: https://github.com/malfet	2024-06-25 13:54:51 +00:00
Li-Huai (Allan) Lin	71ebe5121a	[MPS] Fast math env var (#129007 ) Allow users to decide whether they want to have fast math enabled via env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/129007 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008	2024-06-25 13:52:07 +00:00
Shangdi Yu	bbdeff76fc	fix add decomposition for complex numbers (#129044 ) Fixes #125745 Bug source: When addition requires broadcasting, adding complex numbers is not implemented correctly in `torch/_inductor/decomposition.py` because `x.view(x.real.dtype)` would multiply the last dimension by 2, and then broadcasting wouldn't work. Fix: re-shape the complex tensors after view and before broadcasting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129044 Approved by: https://github.com/zou3519, https://github.com/lezcano	2024-06-25 11:05:41 +00:00
Sanket Jayant Purandare	6508f0f5d4	Improved backward tracking and attribution, fixed typing for python < 3.10 (#129400 ) For #125323 * Fixes typing for python < 3.10 * Fixes #129390 For #124688 * Improved attribution by registering `register_hook` and `post_accumulate_grad_hook` on params. * Fixed pre-mature per module bw peak state initialization for AC. * This improves per-module stats, global `peak_mem` was already accurate and remains unaffected. For #128508 * When AC is applied to a `mod (nn.Module)` the backward order of execution is `pre-bw -> pre-fw -> post-fw -> post-bw`. Since the `ModTracker` maintains the `parents` attribute as set, the `post-fw` during backward was prematurely removing it from parents. * With the fix we now maintain a per-module counter and only remove a module from `parents` when its counter goes to 0. * Added tests to ensure this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129400 Approved by: https://github.com/awgu, https://github.com/huydhn	2024-06-25 10:54:58 +00:00
Alexander Grund	63474620ab	test_jit: Replace plain assert by test assert (#128950 ) The plain assert doesn't show the values in case of failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/128950 Approved by: https://github.com/zou3519	2024-06-25 09:04:53 +00:00
Xuehai Pan	0314c4c101	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-25 08:28:38 +00:00
Fuzzkatt	4ca8eecca4	skip test_graph_capture_oom for jetson (#128661 ) On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error: ``` RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, **kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom with self.assertRaisesRegex(RuntimeError, oom_regex): File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__ self._raiseFailure('"{}" does not match "{}"'.format( File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure raise self.test_case.failureException(msg) AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. " ``` This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661 Approved by: https://github.com/eqy, https://github.com/atalman	2024-06-25 08:25:11 +00:00
eqy	8bfd9e9815	[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 ) cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant ~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~ Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-06-25 06:01:50 +00:00
Jiong Gong	533c4190f9	[inductor][cpp] support nested kernel with indirect indexing (#129223 ) This PR makes sure the current kernel is used for generating CSE variables when nested kernel codegen is involved, e.g., nested CppKernel is used to generate epilogue of CppTemplateKernel. Without the fix, the epilogue with indirect indexing would fail to run. pytest -k test_linear_with_embedding_bias_False_cpu test_cpu_select_algorithm.py Epilogue code Before: ```c++ { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)m_start)); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp11 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0x0)), 16); auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 ? tmp3 : tmp0; auto tmp5 = decltype(tmp4)(tmp4 + tmp2); auto tmp6 = tmp1 ? tmp5 : tmp4; auto tmp7 = tmp6; auto tmp8 = c10::convert<int64_t>(tmp7); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); auto tmp10 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384Ltmp6)), 16); auto tmp12 = (tmp11); auto tmp13 = tmp10 + tmp12; tmp13.store(Y + static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp11 = local_acc_buf[static_cast<long>(x1 + (N0x0))]; auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 ? tmp3 : tmp0; auto tmp5 = decltype(tmp4)(tmp4 + tmp2); auto tmp6 = tmp1 ? tmp5 : tmp4; auto tmp7 = tmp6; auto tmp8 = c10::convert<int64_t>(tmp7); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); auto tmp10 = in_ptr3[static_cast<long>(n_start + x1 + (384Ltmp6))]; auto tmp12 = c10::convert<float>(tmp11); auto tmp13 = decltype(tmp10)(tmp10 + tmp12); Y[static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))] = tmp13; } } } ``` Epilogue code After: ```c++ { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)m_start)); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp13 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0x0)), 16); auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = decltype(tmp5)(tmp5 + tmp2); auto tmp7 = tmp5 < 0; auto tmp8 = tmp7 ? tmp6 : tmp5; auto tmp9 = tmp8; auto tmp10 = c10::convert<int64_t>(tmp9); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384Ltmp8)), 16); auto tmp14 = (tmp13); auto tmp15 = tmp12 + tmp14; tmp15.store(Y + static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp13 = local_acc_buf[static_cast<long>(x1 + (N0x0))]; auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = decltype(tmp5)(tmp5 + tmp2); auto tmp7 = tmp5 < 0; auto tmp8 = tmp7 ? tmp6 : tmp5; auto tmp9 = tmp8; auto tmp10 = c10::convert<int64_t>(tmp9); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); auto tmp12 = in_ptr3[static_cast<long>(n_start + x1 + (384Ltmp8))]; auto tmp14 = c10::convert<float>(tmp13); auto tmp15 = decltype(tmp12)(tmp12 + tmp14); Y[static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))] = tmp15; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129223 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-06-25 05:21:00 +00:00
cdzhan	665dbc2f52	[easy][DCP] Fix test_fine_tuning.py for get/set_state_dict API changes (#129365 ) Update test/distributed/checkpoint/e2e/test_fine_tuning.py for https://github.com/pytorch/pytorch/pull/112203 change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129365 Approved by: https://github.com/fegin	2024-06-25 05:12:02 +00:00
titaiwangms	0e1e289033	[ONNX] Benchmark refactored ONNX export (#129427 ) Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427 Approved by: https://github.com/justinchuby	2024-06-25 04:47:53 +00:00
Mikayla Gawarecki	f18becaaf1	Add example for torch.serialization.add_safe_globals (#129396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396 Approved by: https://github.com/albanD ghstack dependencies: #129244, #129251, #129239	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	381ce0821c	Add warning for weights_only (#129239 ) Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239 Approved by: https://github.com/albanD ghstack dependencies: #129244, #129251	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	c5f7755e86	Allow BUILD/NEWOBJ instruction for items added via torch.serialization.add_safe_globals (#129251 ) Previously, allowlisting functions/classes via `torch.serialization.add_safe_globals(obj)` for the `weights_only` Unpickler had the following effect: - For a [`GLOBAL`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1939) instruction, `GLOBAL obj.__module__ obj.__name__` would be allowed and translated back to obj to be pushed back to the stack. - For a [`REDUCE`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1982) instruction where we expect the stack to contain `func` and `args`, `func` is allowed if it was added via `add_safe_globals` However, it did not have an effect on `BUILD` and `NEWOBJ` instructions Some classes may be rebuilt via [`NEWOBJ`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L2091-L2104) instruction, which indicates that their constructor should be used to rebuild the class. Further, a [`BUILD`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1984-L2007) instruction might be used if an object's `__reduce__`/`__reduce_ex__` returns a non-None value for `state`. Which indicates a `__setstate__` or `__dict__.update`. This PR makes sure that adding objects to the allowlist will also allow `NEWOBJ` and `BUILD` instructions for them. In particular, the update for `NEWOBJ` should unblock allowlisting of [`ScaledMMConfig`](`d4ade877df/float8_experimental/float8_tensor.py (L26-L30)`) in float8_experimental @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/129251 Approved by: https://github.com/albanD ghstack dependencies: #129244	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	1bb1e3463c	Fix allowlisting of builtins for weights_only unpickler (#129244 ) Since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), some functions/classes that were renamed from python 2-->3 will be pickled with their python2 name. This PR ensures that when a mod `GLOBAL <python2_mod>.<python2_name> ` is encountered, [following the strategy used by pickle](https://github.com/python/cpython/blob/main/Lib/pickle.py#L1590C13-L1593C63) it is properly mapped to `<python3_mod>.<python3_name>`. This fix ensures that `add_safe_globals` works properly for such functions/classes (i.e. users will allowlist the python3 func and the weights_only unpickler will do the appropriate translation when checking whether a class was allowlisted). An example is as follows: `__builtin__` was named to `builtins`, see the [release notes for Python 3.0](https://docs.python.org/3/whatsnew/3.0.html) > Renamed module `__builtin__` to [`builtins`](https://docs.python.org/3/library/builtins.html#module-builtins) (removing the underscores, adding an ‘s’). The __builtins__ variable found in most global namespaces is unchanged. To modify a builtin, you should use [builtins](https://docs.python.org/3/library/builtins.html#module-builtins), not `__builtins__`! However, since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), builtins will be pickled with their module string as `__builtin__`. ```python >>> import pickle >>> import pickletools >>> print.__module__ 'builtins' >>> with open('print.pkl', 'wb') as f: >>> pickle.dump(print, f, protocol=2) # 2 because this is the default protocol used by pytorch >>> with open('print.pkl', 'rb') as f: >>> pickletools.dis(f) 0: \x80 PROTO 2 2: c GLOBAL '__builtin__ print' # pickle saves the module string as __builtin__ !!! :( 21: q BINPUT 0 23: . STOP ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129244 Approved by: https://github.com/albanD	2024-06-25 04:19:44 +00:00
Will Feng	aa4ee2cb9e	[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 ) Test command: `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247 Approved by: https://github.com/bdhirsh ghstack dependencies: #129414	2024-06-25 03:13:38 +00:00
Will Feng	b24787b757	[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 ) This makes it easier to do pattern-matching on `fsdp.split_with_sizes_copy` in Inductor passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129414 Approved by: https://github.com/bdhirsh	2024-06-25 03:08:56 +00:00
Isuru Fernando	e6bfa2958b	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-25 02:45:02 +00:00
Zain Rizvi	4d04203852	[BE] Runner determinator: Expect usernames to be prefixed with '@' (#129246 ) Expect the username in the runner rollover issue (https://github.com/pytorch/test-infra/issues/5132) to be prefixed with a "@". This will make typos way less likely since github's autocomplete/autoformating will help out For now, I've updated the issue to have usernames both with and without the @ while this change rolls out Testing: Ran the script locally on both this issue and a new test issue and verified they both had the expected output: ``` (venv) (base) ➜ ~/pytorch git:(zainr/improve-get-workflow-type) python .github/scripts/get_workflow_type.py --github-token github_pat_*** --github-issue 5132 --github-user ZainRizvi --github-branch "zainr/stuff" {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129246 Approved by: https://github.com/zxiiro, https://github.com/huydhn	2024-06-25 02:39:33 +00:00
Kazuaki Ishizaki	533395e204	Fix build error on s390x (#129326 ) This PR fixes the build error on s390 after #127195. The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x. ``` ... [792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o [793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp /pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()': /pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'? 60 \| long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA); \| ^~~~~~~~~~~~~~ \| SYS_prctl [794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o [795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o [796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o [797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o [798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o [799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o [800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o [801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o [802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o ninja: build stopped: subcommand failed. Building wheel torch-2.5.0a0+git94dc325 -- Building version 2.5.0a0+git94dc325 cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch cmake --build . --target install --config Release Build step 'Execute shell' marked build as failure ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129326 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-06-25 02:39:13 +00:00
Animesh Jain	c4dd752d97	[dynamo][compile-time][inlining-inbuilt-nn-modules] Manually implement nn.Module._call_impl (#129285 ) # Compile time for eager backend ## AlbertForMaskedLM No inlining - 3.65 seconds Inlining on main - 7.48 seconds Inlining + this PR - 2.86 seconds ## MobileBertForMaskedLM No inlining - 26.90 seconds Inlining on main - 48.21 seconds Inlining + this PR - 24.25 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/129285 Approved by: https://github.com/jansel ghstack dependencies: #129316, #129315	2024-06-25 01:31:26 +00:00
Animesh Jain	514f9279f8	[dynamo][compile-time] Manually implement nn.Module.__getattr__ to reduce compile time (#129315 ) # Compile time for eager backend ## AlbertForMaskedLM No inlining - 3.65 seconds Inlining on main - 7.48 seconds Inlining + this PR - 6.70 seconds ## MobileBertForMaskedLM No inlining - 26.90 seconds Inlining on main - 48.21 seconds Inlining + this PR - 43.85 seconds Next PR in the stack makes the total compile time better/comparable to no inlining Pull Request resolved: https://github.com/pytorch/pytorch/pull/129315 Approved by: https://github.com/jansel ghstack dependencies: #129316	2024-06-25 01:31:26 +00:00
PyTorch MergeBot	c012013aa6	Revert "Add Strided Input test for flex attention (#128915 )" This reverts commit 41bb81b58279f492e72bd270b3b071dd2953ed8c. Reverted https://github.com/pytorch/pytorch/pull/128915 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its tests are failing in trunk, i.e. `41bb81b582 (26627138290)` ([comment](https://github.com/pytorch/pytorch/pull/128915#issuecomment-2187695317))	2024-06-25 00:43:34 +00:00
Colin Peppler	1315be4893	[aotinductor] only autotune at compile time when enabled via config (#129413 ) internal breakage when enabled. Test Plan: CI Differential Revision: D58965784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129413 Approved by: https://github.com/jingsh, https://github.com/desertfire	2024-06-25 00:41:10 +00:00
Antoni Vros	78e40b271b	Change index_put on GPU to accept FP8 inputs (#128758 ) As the title says, this PR changes the dispatcher for the CUDA index_put_ kernel to accept FP8 inputs. This is useful for Transformers models where the KV cache is FP8 and has been pre-allocated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128758 Approved by: https://github.com/eqy, https://github.com/drisspg	2024-06-25 00:38:03 +00:00
wz337	8b6391ee59	[Test][DTensor] Temporarily skip gloo test for test_depthwise_convolution (#129391 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129391 Approved by: https://github.com/awgu	2024-06-25 00:29:50 +00:00
Shunting Zhang	81de71fdc5	[inductor] fix a double clone in coordesc tuning (#129399 ) It's embarrassing that there is a hidden double clone bug in coordinate descent tuning. In `CachingAutotuner.coordinate_descent_tuning`, we clone mutated args to make sure benchmarking does not cause numerical problems. But latter on in `CachingAutotuner.bench` we do that again. This double clone is fine if - the tensor is small - the allocation of the tensor is not on the critical path for memory footprint. But neither holds for quite common usage of cross entropy loss. This is related to the memory usage debugging in https://github.com/pytorch/pytorch/pull/129043 . Note that the general issue that peak memory usage increasing due to autotuning still exists. This bug just makes it worse (since we double allocate). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129399 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-06-25 00:18:51 +00:00
Nikita Shulga	14dc08ddc7	Inductor to fail gracefully on Voltas for bf16 tensors (#129288 ) Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame. Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`. Test plan: Modify `is_bf16_supported` to return False and see that warning is generated Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288 Approved by: https://github.com/eqy, https://github.com/jansel	2024-06-25 00:04:13 +00:00
Sam Larsen	4c1e4c5f30	[inductor] Enable FX graph caching in OSS by default (#125863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863 Approved by: https://github.com/eellison, https://github.com/oulgen ghstack dependencies: #129257	2024-06-24 23:39:43 +00:00
Sam Larsen	7b57ddd38c	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-24 23:39:43 +00:00
Yidi Wu	b22f0f5f51	[torchbind] fix bug of mutating FakeScriptObjects twice in aot_export (#128844 ) This PR does two things: 1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong. 2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844 Approved by: https://github.com/angelayi	2024-06-24 23:14:34 +00:00
joydddd	41bb81b582	Add Strided Input test for flex attention (#128915 ) Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-06-24 22:56:39 +00:00
yuqingj	00f675bb4c	[Nested Tensor]fix sdpa backward for the special case with ragged second batch dim and constant length (#128349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128349 Approved by: https://github.com/jbschlosser	2024-06-24 22:35:07 +00:00
Joel Schlosser	7b7f357042	Fix DEBUG=1 asserts with NJT ops (#129014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-06-24 22:32:01 +00:00
Isuru Fernando	5f912f480c	Fix max_pool2d decomposition for empty list and integer limits (#129106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106 Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #129096, #129097	2024-06-24 22:19:42 +00:00
Isuru Fernando	e096faaf30	Fix rot90 decomposition for no rotation (#129097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129097 Approved by: https://github.com/peterbell10 ghstack dependencies: #129096	2024-06-24 22:19:42 +00:00
Isuru Fernando	fbca70718f	Fix scatter lowering when src is a Number (#129096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129096 Approved by: https://github.com/peterbell10	2024-06-24 22:19:39 +00:00
Zain Rizvi	8edb7b96b1	Enable dynamic rollout for pull workflow (#129243 ) Enables dynamic migration of jobs to the LF AWS account for the pull workflow. For now, it leaves out a few jobs that need a bit more testing: Namely Windows and Android runners. The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132 Note: The non-pull jobs updated are the ones that have are synced to jobs in pull.yml (via `sync-tag`) and thus have to be updated whenever their corresponding pull.yml jobs are edited Based on https://github.com/pytorch/pytorch/pull/128597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129243 Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/malfet	2024-06-24 22:15:53 +00:00
ajbrent	30bfdf1afc	Errors when 0-dim tensor of complex or bool type passed to aminmax. (#128404 ) Fixes #126742 Added errors for the case of 0-dim tensors of complex or bool types passed to aminmax. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128404 Approved by: https://github.com/janeyx99	2024-06-24 21:46:49 +00:00
PyTorch UpdateBot	18fdc0ae5b	[executorch hash update] update the pinned executorch hash (#129099 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129099 Approved by: https://github.com/pytorchbot	2024-06-24 21:01:40 +00:00
Xuehai Pan	93a33bf3ac	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 18:04:38 +00:00
PyTorch MergeBot	1a54bb0f96	Revert "[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 )" This reverts commit 4f9399bd0d2bc0cbd14348b80e32b263de5c6bc0. Reverted https://github.com/pytorch/pytorch/pull/126417 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/126417#issuecomment-2186999121))	2024-06-24 16:50:15 +00:00
PyTorch MergeBot	063facf352	Revert "[halide-backend] Generate standalone runtime (#129025 )" This reverts commit 10c64c3b49e2008a50f9229e600c68c8a3d49292. Reverted https://github.com/pytorch/pytorch/pull/129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129025#issuecomment-2186995467))	2024-06-24 16:47:25 +00:00
Huy Do	c888ee3632	Remove test_mps_allocator_module XFAIL (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980	2024-06-24 16:26:38 +00:00
PyTorch MergeBot	cb4919344a	Revert "[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 )" This reverts commit e53d9590287cbf97521f96d055910394f6e9a849. Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))	2024-06-24 16:18:43 +00:00
PyTorch MergeBot	7b910285db	Revert "[inductor] Refactor fusion of inplace operations (#128979 )" This reverts commit 72e3aca227ae1e3dc1b91aee415cf27b0cb22f2b. Reverted https://github.com/pytorch/pytorch/pull/128979 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128979#issuecomment-2186846940))	2024-06-24 15:29:40 +00:00
Colin Peppler	df51d0b623	[aotinductor][UserDefinedTritonKernel] use appropriate expr printer when printing args (#129301 ) Encountered the following C++ compile error. ``` Declared in this scope; did you mean ‘std::max’? 619 \| auto var_5 = max(1, u0); ``` This PR will use the C++ printer when it's doing C++ codegen, before this PR it was using the Python printer even during C++ codegen. Differential Revision: [D58913123](https://our.internmc.facebook.com/intern/diff/D58913123) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129301 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-24 15:23:05 +00:00
Xuehai Pan	e53d959028	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 14:35:41 +00:00
soulitzer	c89a9f5d17	Allow SAC policy_fn to return bool for backward compatibility (#129262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129262 Approved by: https://github.com/Chillee, https://github.com/fmassa ghstack dependencies: #125795, #128545	2024-06-24 13:54:30 +00:00
Andrew Gu	9094248090	[FSDP2] Fixed `unshard` without lazy init (#129241 ) Previously, the `FSDPCommContext` only defines the stream attributes when `FSDPCommContext.init` is called from lazy initialization. This means that if the user calls `module.unshard()` before lazy init (e.g. first forward pass), then it would error in `wait_for_unshard()`. This PR fixes this by making sure that the stream attributes are defined, only with the default stream, at construction time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129241 Approved by: https://github.com/Skylion007, https://github.com/weifengpy	2024-06-24 13:31:54 +00:00
Will Feng	d21f311af8	[Easy][Traceable FSDP2] Skip rocm for the E2E tests (#129339 ) The CUDA implementation of `resize_storage_bytes_` doesn't run on rocm yet, so need to skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129339 Approved by: https://github.com/msaroufim	2024-06-24 06:38:33 +00:00
Xuehai Pan	662e9e1076	[BE] enable UFMT for `torch/nn/functional.py` (#128592 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592 Approved by: https://github.com/mikaylagawarecki	2024-06-24 06:24:12 +00:00
leslie-fang-intel	8a2fed7e6a	[Inductor][CPP] Fallback QLinear Binaryfusion from postop sum to binary add when others is view (#128808 ) Summary In int8 GEMM Template, we will view the input from 3D to 2D and view the output back to 3D for QLinear which makes the output of this QLinear as `view`. So, if this output view inputs to a QLinear-Binary fusion which breaks the assumption of QLinear-Binary with post op inplace `sum`. We change the postop name from inplace `sum` to outplace `add` for this case which is similar as FP32/BF16 Linear Inplace as in `1208347d09/torch/_inductor/fx_passes/mkldnn_fusion.py (L541-L543)`. TestPlan ``` clear && numactl -C 56-111 -m 1 python -u -m pytest -s -v inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_cpu_input_dim_exceeds_2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128808 Approved by: https://github.com/jgong5 ghstack dependencies: #128804	2024-06-24 01:12:18 +00:00
leslie-fang-intel	287c68c5ec	[Inductor][Quant] Use output dtype torch.uint8 explicitly (#128804 ) Summary Previously, we use `None` as output data type in the lowering of QLinear/QConv for uint8 implicitly. It's not clear and we should use `torch.uint8` explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128804 Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5	2024-06-24 01:08:49 +00:00
PaliC	7b9e6430ed	[Split Build] Add periodic and trunk CI for cuda builds (#129269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129269 Approved by: https://github.com/atalman	2024-06-23 17:04:37 +00:00
Xuehai Pan	f85d1e845a	[BE] enable UFMT for `torch/nn/*.py` (#128593 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593 Approved by: https://github.com/mikaylagawarecki	2024-06-23 16:05:13 +00:00
Will Feng	dadc0ed4c8	[Traceable FSDP2] Add `aot_eager` backend E2E tests for transformer model (#129157 ) This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157 Approved by: https://github.com/awgu ghstack dependencies: #129203	2024-06-23 06:11:11 +00:00
Brian Hirsh	b91a9dc328	[Brian's PR #128754 ] Use torch.ops.fsdp.set_ for FSDP2 storage resize; dont functionalize resize_, set_, split_with_sizes_copy.out (#129203 ) This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203 Approved by: https://github.com/bdhirsh	2024-06-23 06:07:19 +00:00
Xuehai Pan	62ccf6d7cd	[BE] enable UFMT for `torch/nn/modules` (#128594 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594 Approved by: https://github.com/mikaylagawarecki	2024-06-23 05:37:57 +00:00
sanketpurandare	440d8fbd4a	FSDP2 Memory Tracker (#125323 ) * __->__ #125323 ### Why do we need the FSDP Memory Tracker? Tuning Decisions 1. What is the expected peak memory with current configuration? 2. If I change my FSDP wrapping, how much effect will it have on peak memory? 3. What is the best batch size to use? 4. What is the maximum sequence length that one can run with current configuration? 5. How does increasing/decreasing the “DP” world size affect peak memory? 6. How much memory do I save if I move the optimizer to the CPU? 7. Which activation checkpointing policy should I use? 8. If I have various SAC policies, How do they compare against each other? 9. What happens if I apply different SAC policies to different FSDP units? 10. If I make my gradient reduction in fp32, what effect will it have on memory? 11. If I want to use a custom mixed precision policy, how will it affect the peak memory? 12. When does it make sense to use HSDP? 13. Can I reshard to a smaller mesh without increasing peak memory substantially? 14. Can safely disable post forward reshard without causing an OOM? Debugging 1. Which module contributes most to activation memory? 2. Which FSDP unit is holding a lot of unsharded memory? 3. AC is not releasing memory? The FSDP2 Memory Tracker addresses all of the above. It is based on: * #124688 * #128508 Example and Output: ``` if __name__== "__main__": from contextlib import nullcontext from functools import partial import torch from torch.distributed._composable import checkpoint from torch.distributed._composable.fsdp import ( CPUOffloadPolicy, fully_shard, MixedPrecisionPolicy, ) from torch.distributed._tensor import DeviceMesh from torch.distributed._tools.fsdp2_mem_tracker import FSDPMemTracker from torch._subclasses.fake_tensor import FakeTensorMode from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, TransformerBlock, ) from torch.testing._internal.distributed.fake_pg import FakeStore dev = torch.device("cuda:0") torch.cuda.set_device(dev) world_size = 4 store = FakeStore() torch.distributed.init_process_group( "fake", rank=0, world_size=world_size, store=store ) mesh = DeviceMesh("cuda", torch.arange(0, world_size)) torch.cuda.empty_cache() torch.manual_seed(42) use_fake_mode = False with FakeTensorMode() if use_fake_mode else nullcontext(): vocab_size = 8192 bsz, seq_len = 32, 1024 with torch.device(dev): model_args = ModelArgs( n_layers=2, n_heads=16, vocab_size=vocab_size, max_seq_len=seq_len, dropout_p=0.1, ) model = Transformer(model_args) foreach = True mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32) offload_policy = CPUOffloadPolicy(pin_memory=not use_fake_mode) reshard_after_forward = True fsdp_config = { } fully_shard_fn = partial( fully_shard, mesh=mesh, reshard_after_forward=reshard_after_forward, offload_policy=offload_policy, mp_policy=mp_policy, ) for module in model.modules(): if isinstance(module, TransformerBlock): checkpoint(module, preserve_rng_state=not use_fake_mode) fully_shard_fn(module) fully_shard_fn(model) optim = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=foreach) torch.manual_seed(42) inp = torch.randint(0, vocab_size, (bsz, seq_len), device=dev) torch.cuda.reset_accumulated_memory_stats() torch.cuda.reset_peak_memory_stats() fmt = FSDPMemTracker(model, optim) fmt.track_inputs((inp,)) with fmt: for iter_idx in range(2): loss = model(inp).sum() loss.backward() optim.step() optim.zero_grad() if iter_idx == 0: fmt.reset_mod_stats() mem_stats = torch.cuda.memory_stats() tracker_peak = fmt.get_tracker_snapshot("peak")[dev]["Total"] cuda_peak_active = mem_stats["active_bytes.all.peak"] fmt.display_modulewise_snapshots(depth=4, units="MiB", tabulate=True) fmt.display_snapshot("peak", units="MiB", tabulate=True) print( f"peak active: {cuda_peak_active / (10243)} GiB \| " f"Tracker Max: {tracker_peak / (1024 3)} GiB" ) if not use_fake_mode: print(f"Accuracy: {tracker_peak/cuda_peak_active}") try: torch.distributed.destroy_process_group() except Exception as e: print(e) ``` <img width="1236" alt="Screenshot 2024-06-21 at 5 16 49 PM" src="https://github.com/pytorch/pytorch/assets/12934972/9be40b8b-e635-4112-b111-418413e6b959"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125323 Approved by: https://github.com/awgu	2024-06-23 05:23:00 +00:00
Animesh Jain	17d1723aee	[dynamo][unspecialized-nn-modules] Remove dead (also incorrect) code (#129316 ) This code is unused because we just inline the `.parameters` call. The code was also wrong because side-effects only track the first level of mutations. An object might not marked mutated if one of the child objects (like a dict) is mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129316 Approved by: https://github.com/jansel	2024-06-23 03:02:27 +00:00
Huy Do	cac6f99d41	Fix Windows CUDA periodic inductor/test_pattern_matcher test (#129198 ) The check was run on Windows and crashed there because Windows doesn't have triton, i.e. https://github.com/pytorch/pytorch/actions/runs/9606662121/job/26502347998#step:15:13196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129198 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/malfet	2024-06-23 02:32:27 +00:00
Manuel Candales	749c03406c	[metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965 ) Adds _convert_weight_to_int4pack MPS kernel Replaces previous int4mm Metal shader, with shader authored by @kimishpatel which improves perf by ~40% Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965 Approved by: https://github.com/malfet	2024-06-23 02:10:46 +00:00
rzou	856541c701	[custom_op] support default dtype values (#129189 ) This PR: - moves some of the dtype-string utilities into ScalarType.{h, cpp} - adds a new utility to get a mapping from dtype name to the C++ dtype - the perser now checks if the string is a dtype name; if it is then it pulls the c++ dtype from the mapping. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129189 Approved by: https://github.com/albanD ghstack dependencies: #129177, #129178, #129179	2024-06-23 00:13:23 +00:00
Isuru Fernando	3e02ecd740	Test only one sample with huber_loss (#129245 ) Fixes https://github.com/pytorch/pytorch/issues/129238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129245 Approved by: https://github.com/huydhn	2024-06-22 21:15:39 +00:00
Xuehai Pan	94dc3253a0	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-06-22 18:53:28 +00:00
Will Feng	e165a5971f	[Traceable FSDP2] Fix support for CUDA resize_storage_bytes_ (#129215 ) Currently if `x` is a CUDA tensor, calling `x.untyped_storage().resize_()` seems to always go into the `built without cuda` branch of `resize_storage_bytes_()` regardless of whether PyTorch is built with CUDA. I suspect this is because `inductor_ops.cpp` is only included in `libtorch_cpu.so` thus doesn't have the `USE_CUDA` information or ability to link to CUDA-related functions. This PR moves `resize_storage_bytes_()` related custom op functions out of `inductor_ops.cpp` into its standalone file `resize_storage_bytes.cpp` to be included in `libtorch_python.so` instead. This mimics the setup for `StorageMethods.cpp`. This way, `resize_storage_bytes_()` can have access to the CUDA-related functions, which passes the CUDA unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129215 Approved by: https://github.com/jansel	2024-06-22 18:38:47 +00:00
Anshul Sinha	0e6118a68e	[dtensor][debug] added logging module tracing table to file feature (#128721 ) Summary Currently, only way for users to view the module tracing table is to print in the console which could be hard to read. I have added the functionality to comm_debug_mode for a user to log the module tracing table to output.txt file giving the user more options to view module tracing. I have implemented the use case in the module tracing examples. The expected output is shown below for MLPModule tracing: <img width="349" alt="Screenshot 2024-06-14 at 10 39 07 AM" src="https://github.com/pytorch/pytorch/assets/50644008/a05288a9-3cdb-483b-8e27-daab50da6251"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/128721 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #128720	2024-06-22 18:14:13 +00:00
Anshul Sinha	1afd492d88	[dtensor][example] add functionality allowing users to choose which example they'd to run (#128720 ) Summary The previous example file would run all examples at the same time, leading to confusing output as the 4 processors would mix up the order. In order to fix this, I have added the functionality to choose which example to run to make it easier for users to read the output. Due to importing from torch.testing._internal.distributed._tensor.common_dtensor, the argparser from a file in the dependency tree would overwrite the argparser that I attempted to place in the example file. As a result, I created an argparser in a different file and imported it above previously mentioned import. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -h The first four outputs will be the same as the outputs seen in previous PRs. The expected output for help argument is seen below: <img width="931" alt="Screenshot 2024-06-14 at 10 25 06 AM" src="https://github.com/pytorch/pytorch/assets/50644008/547ca112-1e7a-4769-857a-558292c6fe7b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128720 Approved by: https://github.com/XilunWu	2024-06-22 18:14:13 +00:00
Jason Ansel	10c64c3b49	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-22 17:39:52 +00:00
Jason Ansel	4f9399bd0d	[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-22 17:39:52 +00:00
William Wen	79aabaf626	[3.13, dynamo] codegen PUSH_NULL when callable is codegen'd (#129172 ) Significant bytecode generation API change! The new suggested convention to generating bytecode to call a function is now to wrap instructions that push a callable to the stack with `add_push_null`, then that callable is called with `create_call_function` with `push_null=False` (see diff for examples). In Python 3.13, NULL is now expected to be pushed after the callable. In <=3.12, the NULL was pushed before the callable. This change abstracts away the exact placement of the NULL, but the developer must be aware that a NULL may be needed when codegen'ing a callable. This abstraction also reduces the need for the `push_null=True` option in `create_call_function`, which removes the need to rotate a NULL to the right place on the stack with a sequence of `SWAP` instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129172 Approved by: https://github.com/jansel	2024-06-22 17:25:23 +00:00
Mengwei Liu	905dfa186c	Fix ConstraintViolationError exception string when exprs are int (#129271 ) As titled. If `expr1` `expr2` are int, don't need to do `.xreplace`. See example error: ``` UserError: L['args'][0][0].size()[1] = 35 is not equal to L['args'][0][2].size()[1] = 23 ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129271 Approved by: https://github.com/lezcano	2024-06-22 16:33:40 +00:00
Jiong Gong	920ebccca2	[inductor][cpp] refactor CppTemplateKernel to inherit CppKernel (#129101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129101 Approved by: https://github.com/leslie-fang-intel	2024-06-22 12:50:37 +00:00
Peter Bell	72e3aca227	[inductor] Refactor fusion of inplace operations (#128979 ) `WeakDep`s force readers to have completed before a mutation overwrites the buffer, but we want to allow fusions to occur for inplace mutations where the same index is read and written. Currently this is achieved by: 1. Identifying the buffers used by the mutating op in its `dep_closure` 2. Not creating `WeakDep`s for buffers in the `dep_closure` 3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical` So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup. This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to `can_fuse_vertical` which selectively allows inplace operation to fuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979 Approved by: https://github.com/lezcano ghstack dependencies: #129082, #129083	2024-06-22 12:38:22 +00:00
Peter Bell	88a35b5b64	BE: User future annotations in _inductor/comms.py (#129083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129083 Approved by: https://github.com/lezcano ghstack dependencies: #129082	2024-06-22 12:38:22 +00:00
Peter Bell	73ba226d98	[inductor] Linear time dead node elimination (#129082 ) The nodes are already topologically sorted by this point, so DCEing a chain of nodes will take one full iteration per node. Simply reversing the iteration order means all users will be removed before checking a node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129082 Approved by: https://github.com/lezcano	2024-06-22 12:38:17 +00:00
Jiong Gong	cb126711cd	[merge_rule] add more cpp inductor files (#129192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129192 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2024-06-22 09:04:14 +00:00
PaliC	b57fa8d9c0	[BE] Remove JNI from libtorch builds (#124995 ) Removes jni files from the libtorch build as we do not plan to distribute them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124995 Approved by: https://github.com/malfet	2024-06-22 07:41:54 +00:00
Driss Guessous	9ffdbb5d12	Forward Fix PR for #128683 (#129037 ) Summary: This forward fixes this diff: D58699985 Since we have a few things in flight it would be much better to forward fix this test Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)' Differential Revision: D58767577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129037 Approved by: https://github.com/vkuzo	2024-06-22 05:50:21 +00:00
PaliC	64743de6d8	[Split Build][BE] consolidate pip install commands (#129253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129253 Approved by: https://github.com/atalman ghstack dependencies: #129011	2024-06-22 05:49:14 +00:00
PaliC	7661d1220a	[Split Build] Fix typo in pull ci (#129270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129270 Approved by: https://github.com/atalman	2024-06-22 05:48:01 +00:00
PaliC	b0044e2e18	[Split Build] Support nightly release (#129011 ) This PR adds the split build to our binaries workflow. Validation for the workflow is done using the PR above in conjunction with https://github.com/pytorch/builder/pull/1876. Test Workflow: Check CI in the workflow above Pull Request resolved: https://github.com/pytorch/pytorch/pull/129011 Approved by: https://github.com/atalman	2024-06-22 05:45:14 +00:00
Huy Do	b72ef9df0d	Update torchbench model expected accuracy values after pinning numpy (#129213 ) After pinning numpy on torchbench, we need to move torchbench inductor benchmark jobs out of unstable state asap, so that more failures don't sneak it. I'm updating the expected values here to make trunk green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129213 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/desertfire	2024-06-22 04:59:50 +00:00
Aaron Enye Shi	f42d5b6dca	[Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242 ) Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric. Test Plan: CI and ran locally. Differential Revision: D58875576 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242 Approved by: https://github.com/zdevito	2024-06-22 04:05:55 +00:00
chilli	858fb05dac	Modify ExternKernelAlloc with NoneLayout to not assign its result to anything (#129188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129188 Approved by: https://github.com/yifuwang	2024-06-22 02:57:44 +00:00
Will Constable	2f8b301c32	Clean up distributed/CONTRIBUTING.md (#128450 ) Click [here](`cf6c88af48/torch/distributed/CONTRIBUTING.md`) to see the rendered version of the file in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/128450 Approved by: https://github.com/wanchaol	2024-06-22 02:41:22 +00:00
James Wu	5b14943213	Run TestAOTAutograd test suite with cache (#128222 ) This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache. To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss. We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future. In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222 Approved by: https://github.com/bdhirsh	2024-06-22 02:13:28 +00:00
Animesh Jain	c5b9ee7408	[easy][dynamo] Remove try except from call_getattr (#129217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129217 Approved by: https://github.com/lezcano ghstack dependencies: #129098, #129015	2024-06-21 23:56:00 +00:00
PyTorch MergeBot	1c75ddff35	Revert "[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 )" This reverts commit 40e8675fcbb233c98ec532607d5cd421ec850253. Reverted https://github.com/pytorch/pytorch/pull/128271 on behalf of https://github.com/malfet due to This makes PyTorch buildable only with CuDNN v9 ([comment](https://github.com/pytorch/pytorch/pull/128271#issuecomment-2183576996))	2024-06-21 23:29:20 +00:00
mori360	ef55446538	[FSDP2] Add 'TORCH_LOGS=+fsdp' to log hooks(pre/post forward/backward) and FQN (_init_fqns) (#128663 ) Summary: Add '`TORCH_LOGS=+fsdp`' in the CLI to print fsdp logs Example: `TORCH_LOGS=+fsdp torchrun --standalone --nproc_per_node=2 run_fsdp.py` Description: Add logging to `FSDPParamGroup.pre_forward`, `FSDPParamGroup.post_forward`, `FSDPParamGroup.pre_backward`, and `FSDPParamGroup.post_backward`, `FSDPState._root_pre_forward` if is the root, and `FSDPState._root_post_backward_final_callback`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128663 Approved by: https://github.com/weifengpy, https://github.com/awgu	2024-06-21 23:25:58 +00:00
Menglu Yu	9d1b65b569	[PT2][Observability] Change the log logic (#129201 ) Summary: We only log the multiplier when users changes the default value. Test Plan: see signal Differential Revision: D58854330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129201 Approved by: https://github.com/Skylion007, https://github.com/dshi7	2024-06-21 21:48:34 +00:00
Eddie Yan	40e8675fcb	[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 ) cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant ~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~ Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271 Approved by: https://github.com/ezyang	2024-06-21 21:40:23 +00:00
Mashrur Morshed	9103b40a47	Fix small typo in docstring in ParameterList (#129193 ) In the docstring of `nn.ParameterList`, ParameterDict.append/extend was being used, which is most likely a typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129193 Approved by: https://github.com/mikaylagawarecki	2024-06-21 20:53:52 +00:00
Andrew M. James	92ca17d85d	Update triton pin (#126098 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126098 Approved by: https://github.com/bertmaher	2024-06-21 18:46:15 +00:00
Aaron Gokaslan	d52684e9a8	[BE]: Update CUDNN_frontend submodule to v1.5.1 (#128612 ) Updates submodule to cudnn_frontend v1.5.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128612 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-21 18:17:35 +00:00
soulitzer	ebf25e128c	[autograd] Do not stash version counter for saved tensor (#128545 ) Fixes https://github.com/pytorch/pytorch/issues/128611 We detach using tensor_data, which already preserves the version counter, so there is no reason to save it prior to unpacking: ``` at::TensorBase VariableHooks::tensor_data(const at::TensorBase& self) const { TORCH_CHECK(self.defined(), "cannot call tensor_data() on undefined tensor"); auto self_impl_copy = self.unsafeGetTensorImpl()->shallow_copy_and_detach( /version_counter=/self.unsafeGetTensorImpl()->version_counter(), /allow_tensor_metadata_change=/ self.unsafeGetTensorImpl()->allow_tensor_metadata_change()); return at::Tensor(self_impl_copy); } ``` This changes the behavior when hooks are involved: - Previously, if you had a hook that replaced the saved tensor with an entirely new tensor, we would've smashed the saved version counter onto that during unpack, which is not quite correct because the tensor returned by user's pack hook is not necessarily aliased to the tensor originally being saved (unlikely), and even if it were, the version counter would already be shared, if the user did their operations not in inference mode (unlikely). - In this PR, we restore the version counter using the version counter from the unpack hook's output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128545 Approved by: https://github.com/albanD ghstack dependencies: #125795	2024-06-21 18:03:06 +00:00
Zhuoran Zhao	58cefaf53b	Fix hipify regular expression for AOTI wrapper (#128912 ) Summary: We need to redefine RE_PYTORCH_PREPROCESSOR here since in hipify_torch, it will apply positive lookbehind (?<=\W) and lookahead (?=\W) to the pattern to avoid matching keyword at the beginning and end of code line. However, this can happen in codegen, which will cause the pattern to not match. Test Plan: ``` buck2 run //caffe2/test/inductor:test_cpp_wrapper_hipify ``` ``` File changed: fbcode//caffe2/test/inductor/test_cpp_wrapper_hipify.py Buck UI: https://www.internalfb.com/buck2/395155fa-b2dc-4892-8c71-74e52c65fa2f Note: Using experimental modern dice Network: Up: 0B Down: 0B (reSessionID-8fcfc520-755c-48f9-bacc-507c62f59231) Jobs completed: 10947. Time elapsed: 0.5s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) BUILD SUCCEEDED /data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:282: NCCL_DEBUG env var is set to None /data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:300: NCCL_DEBUG is forced to WARN from None test_hipify_aoti_driver_header (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok test_hipify_basic_declaration (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok test_hipify_cross_platform (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok ---------------------------------------------------------------------- Ran 3 tests in 0.262s OK ``` e2e test: ``` TORCH_LOGS="output_code,graph_code" buck2 run mode/{opt,amd-gpu,inplace} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //aiplatform/modelstore/model_generation/gpu_lowering_service:gpu_lowering_cli -- --model_input_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/input.merge" --model_output_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/mi300_inductor_output.merge" --lowering_backend AOT_INDUCTOR --is_ads_model False --aot_inductor_lowering_settings_json='{"use_scripting":true,"preset_lowerer":"standalone_hstu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":4,"output_precision":4, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}' 2>&1 \| tee local_benchmark_log.txt ``` Differential Revision: D58705216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128912 Approved by: https://github.com/desertfire	2024-06-21 18:00:40 +00:00
iibrahimli	2db33054b3	Disable fast path in `TransformerEncoderLayer` when there are forward (pre-)hooks attached to modules (#128415 ) Fixes #128413 Disable fast-path if there are forward hooks or pre-hooks. Example failure case given in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128415 Approved by: https://github.com/mikaylagawarecki	2024-06-21 17:38:08 +00:00
Bin Bao	8edd4c71c6	[AOTI][refactor] Remove GridExprCppPrinter (#129142 ) Summary: Previously we thought using CppPrinter is not ABI-compatibility safe, but c10/util/generic_math.h has been changed to header-only implementation, so we can remove GridExprCppPrinter now. Differential Revision: [D58854214](https://our.internmc.facebook.com/intern/diff/D58854214) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129142 Approved by: https://github.com/chenyang78	2024-06-21 17:18:37 +00:00
Jason Ansel	bdc39eef3b	[inductor] Add --inductor-config benchmark flag (#129034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024, #129033	2024-06-21 16:53:42 +00:00
Jason Ansel	bb4ab59651	[inductor] Run more test on correct device (#129033 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129033 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024	2024-06-21 16:53:42 +00:00
Jason Ansel	feb3f3ad77	[inductor] Refactors for Halide backend (#129024 ) Pulling these inductor-related refactors out of the larger Halide backend PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129024 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-21 16:53:35 +00:00
chilli	237c4e6163	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-21 15:58:53 +00:00
William Wen	bdd11483ea	[3.13] get C dynamo to compile with python callback and custom frame eval (#129171 ) Start enabling parts of C Dynamo for 3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129171 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-21 15:58:02 +00:00
xinan.lin	b0ae0db815	[Inductor][Intel GPU] Support reduction split. (#129120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129120 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #129124	2024-06-21 15:11:59 +00:00
xinan.lin	fb0c51b61c	[Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722 (#129124 ) [Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722. Currently, XPU CI does not gate PR merge. So, we have to do some post-CI fixing as some PRs may break XPU CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129124 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-06-21 15:08:17 +00:00
PyTorch MergeBot	715b09ae2d	Revert "Fix DEBUG=1 asserts with NJT ops (#129014 )" This reverts commit 2bb8ee602b264b652a9dbd6877da61018054d313. Reverted https://github.com/pytorch/pytorch/pull/129014 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129014#issuecomment-2182922009))	2024-06-21 15:03:02 +00:00
cyy	479ce5e2f4	Remove outdated CUDA code from CMake (#128801 ) It's possible to simplify some CUDA handling logic in CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128801 Approved by: https://github.com/r-barnes, https://github.com/malfet	2024-06-21 15:00:00 +00:00
cyy	2c7c286fa4	[1/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129055 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129055 Approved by: https://github.com/r-barnes	2024-06-21 14:56:31 +00:00
lezcano	53be7ff0e4	Make tl.atomic_add relaxed (#129133 ) We don't use any fancy synchronization within out atomic ops, we just want them to be atomic, so better to have them be relaxed than the default aquire/release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129133 Approved by: https://github.com/peterbell10	2024-06-21 14:49:58 +00:00
Bin Bao	62e5d045c0	[AOTI] Auto-tune Triton kernels in a seperate block (#129057 ) Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library. There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor: * Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation. * Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption. * Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while. This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before. Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-21 14:34:13 +00:00
Sahdev Zala	9795dba1e0	Optim package docstring fix (#129086 ) Fix docstrings in various files in optim package. This is a last remaining fix for the issue #112593 The fix can be verified by running pydocstyle path-to-file --count Fixes #112593 Related #128248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129086 Approved by: https://github.com/janeyx99	2024-06-21 14:30:53 +00:00
Xuehai Pan	b697808056	[BE][Easy] eliminate relative import in `torchgen` (#128872 ) Fix generated by: ```bash ruff check --config 'lint.flake8-tidy-imports.ban-relative-imports="all"' --fix --select=TID $(fd '.pyi?$' torchgen) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128872 Approved by: https://github.com/zou3519	2024-06-21 14:11:46 +00:00
Joel Schlosser	e1c1052829	Backward support for unbind() with NJT (#128032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032 Approved by: https://github.com/soulitzer	2024-06-21 14:05:23 +00:00
haozhe.zhu	27ae1f981d	[inductor] fix linear_add_bias for autocast case (#129138 ) Previously `linear_add_bias` only support the added tensor is `bfloat16`. ``` class M(torch.nn.Module): def __init__(self, dtype): super().__init__() self.linear1 = torch.nn.Linear(10, 64, bias=False) self.bias1 = torch.randn(64).bfloat16() # if the bias is not bf16, we will crash def forward(self, x): return self.linear1(x) + self.bias1 ``` For `Autocast(bf16)` cases, `self.bias1` will not be converted to bf16. And we also not checked the dtype for weight and bias in the pattern matcher, this will lead to error if weight is bfl6 while bias is fp32. We have 2 options to resolve this: - Check bias/weight dtype, only fold the bias when they are same dtype - We will fold them even they are not same dtype. By inserting to_dtypes for `bias node` to enforce it have same dtype with weight. This PR chose option1, since we can't implicitly cast bias to bf16 here which would lose precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129138 Approved by: https://github.com/jgong5	2024-06-21 14:04:30 +00:00
rzou	5d8e23b49c	[custom_op] Support string default values in schema (#129179 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129179 Approved by: https://github.com/albanD ghstack dependencies: #129177, #129178	2024-06-21 13:31:40 +00:00
rzou	08b616281f	[custom ops] Switch out references from old landing page to new landing page (#129178 ) Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129178 Approved by: https://github.com/albanD ghstack dependencies: #129177	2024-06-21 13:31:40 +00:00
rzou	311fadb1fb	[docs] Redirect custom ops landing page to the correct place (#129177 ) I'm moving it to pytorch/tutorials Pull Request resolved: https://github.com/pytorch/pytorch/pull/129177 Approved by: https://github.com/albanD	2024-06-21 13:31:32 +00:00
Yifu Wang	217aac96d7	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-21 08:49:11 +00:00
Simon Fan	f0443ad174	[compiled autograd] flatten runtime inputs with fast path (#129116 ) covered by test_compiled_autograd.py and test_standalone_compile.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129116 Approved by: https://github.com/jansel ghstack dependencies: #127960, #128905, #128982, #128987, #129181	2024-06-21 08:16:33 +00:00
Simon Fan	d97dfe9313	[compiled autograd] move inputs to cuda with non_blocking=True (#129181 ) non_blocking=True requires first pinning, which shouldn't be a problem given that they are cpu scalars Pull Request resolved: https://github.com/pytorch/pytorch/pull/129181 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #127960, #128905, #128982, #128987	2024-06-21 08:16:33 +00:00
Simon Fan	8f320fd6c6	[compiled autograd] treat input params as static (#128987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128987 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng ghstack dependencies: #127960, #128905, #128982	2024-06-21 08:16:33 +00:00
Simon Fan	fafa1867d1	[compiled autograd] use in_compiled_autograd_region instead of compiled_autograd_enabled_count (#128982 ) current implementation of compiled_autograd_enabled_count affects the entire region under the context manager. so if the context manager wraps torch.compile calls unrelated to the backward, they are affected too: - no lazy compile for compiled fw - no aot autograd cache for inference graphs we instead maintain a flag when we execute the compiled backward callable, to isolate the special handling to the compiled backward graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/128982 Approved by: https://github.com/jansel ghstack dependencies: #127960, #128905	2024-06-21 08:16:33 +00:00
Simon Fan	68b33453f4	[aot autograd] collect static parameter metadata when graphs fallback to inference (#128905 ) https://github.com/pytorch/pytorch/pull/126820 but for graphs that have requires_grad inputs but no requires_grad outputs i.e. inference graph the implementation of inference graph fallback was throwing away the static parameter information during metadata recomputation also adding a cudagraphs counter to test this easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/128905 Approved by: https://github.com/mlazos ghstack dependencies: #127960	2024-06-21 08:16:33 +00:00
Simon Fan	123812790b	[compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960 Approved by: https://github.com/jansel	2024-06-21 08:16:33 +00:00
Anshul Sinha	aee512cc9d	[dtensor][op] Fixed stack op strategy (#129018 ) Summary The previous stack op strategy was causing the input to be resharded, resulting in list index out of range error. I delayed the resharding for after the input_specs were created so that the new dimension could be inserted, preventing the error above. I have also ran all the other test cases to ensure changes did not introduce any new bugs Test Plan pytest test/distributed/_tensor/test_tensor_ops.py -s -k test_stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/129018 Approved by: https://github.com/XilunWu	2024-06-21 08:10:28 +00:00
Animesh Jain	6b5fbc544e	[dynamo] Use polyfill to trace through the attributes of torch.jit.* and lru_cache_wrapper (#128336 ) Earlier we were taking the vt for `obj` and then monkeypatching that `vt.source` to be `obj._torchdynamo_inline`. If one accesses `obj.attr_a`, this would cause problems because Dynamo would then search it in `obj._torchdynamo_inline.attr_a`. This PR makes it more functional, so that we have different vts for obj and `ob._torchdynamo_inline`. Fixes https://github.com/pytorch/pytorch/issues/93698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128336 Approved by: https://github.com/jansel, https://github.com/yanboliang ghstack dependencies: #129117	2024-06-21 07:44:44 +00:00
Jiong Gong	914d3ca2ba	[inductor][cpp] BF16 AMX micro-gemm support (#127195 ) This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`. Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C: Static shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| timm_models \| mixer_b16_224 \| 1.54 \| \| timm_models \| convit_base \| 1.53 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.52 \| \| torchbench \| fastNLP_Bert \| 1.44 \| \| torchbench \| llama \| 1.33 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.31 \| \| torchbench \| dlrm \| 1.28 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| huggingface \| MobileBertForMaskedLM \| 1.27 \| \| timm_models \| vit_base_patch16_224 \| 1.26 \| \| timm_models \| beit_base_patch16_224 \| 1.23 \| \| timm_models \| jx_nest_base \| 1.21 \| \| torchbench \| pyhpc_equation_of_state \| 1.18 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.15 \| \| timm_models \| pit_b_224 \| 1.14 \| \| timm_models \| twins_pcpvt_base \| 1.14 \| \| torchbench \| maml_omniglot \| 1.1 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| torchbench \| BERT_pytorch \| 1.35 \| \| torchbench \| lennard_jones \| 2.43 \| \| torchbench \| hf_Albert \| 1.35 \| \| torchbench \| hf_T5 \| 1.34 \| \| torchbench \| soft_actor_critic \| 1.34 \| \| torchbench \| fastNLP_Bert \| 1.28 \| \| huggingface \| LayoutLMForSequenceClassification \| 1.26 \| \| torchbench \| llama \| 1.24 \| \| huggingface \| GPT2ForSequenceClassification \| 1.19 \| \| torchbench \| hf_Bart \| 1.17 \| \| torchbench \| hf_Bert_large \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| timm_models \| gmixer_24_224 \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.15 \| \| torchbench \| maml_omniglot \| 1.14 \| \| torchbench \| hf_Bert \| 1.13 \| \| torchbench \| hf_DistilBert \| 1.13 \| \| torchbench \| hf_T5_large \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.11 \| Dynamic shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| timm_models \| mixer_b16_224 \| 1.52 \| \| timm_models \| convit_base \| 1.5 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.49 \| \| torchbench \| fastNLP_Bert \| 1.42 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.27 \| \| torchbench \| llama \| 1.26 \| \| huggingface \| MobileBertForMaskedLM \| 1.25 \| \| timm_models \| vit_base_patch16_224 \| 1.25 \| \| timm_models \| beit_base_patch16_224 \| 1.24 \| \| timm_models \| jx_nest_base \| 1.2 \| \| torchbench \| dlrm \| 1.19 \| \| timm_models \| pit_b_224 \| 1.13 \| \| timm_models \| twins_pcpvt_base \| 1.13 \| \| torchbench \| hf_Bert_large \| 1.12 \| \| torchbench \| hf_BigBird \| 1.11 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.11 \| \| timm_models \| eca_botnext26ts_256 \| 1.11 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| torchbench \| BERT_pytorch \| 1.18 \| \| torchbench \| lennard_jones \| 2.18 \| \| torchbench \| hf_Albert \| 1.37 \| \| torchbench \| soft_actor_critic \| 1.31 \| \| huggingface \| GPT2ForSequenceClassification \| 1.29 \| \| torchbench \| hf_T5 \| 1.28 \| \| torchbench \| fastNLP_Bert \| 1.27 \| \| torchbench \| hf_Bart \| 1.21 \| \| torchbench \| hf_Bert_large \| 1.19 \| \| torchbench \| hf_T5_large \| 1.19 \| \| torchbench \| hf_Bert \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| huggingface \| CamemBert \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.13 \| \| torchbench \| functorch_maml_omniglot \| 1.12 \| \| huggingface \| BertForMaskedLM \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.12 \| \| torchbench \| hf_DistilBert \| 1.11 \| \| timm_models \| mixnet_l \| 1.11 \| \| timm_models \| tf_mixnet_l \| 1.11 \| No perf regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195 Approved by: https://github.com/jansel	2024-06-21 07:21:47 +00:00
Wu, Chunyuan	632910e2a8	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-21 07:19:28 +00:00
Sanket Jayant Purandare	62e425ab03	Memory Tracker for tracking Module wise memory (#124688 ) We present a utility MemTracker, that tracks the module-wise memory for the code executed under its context. The core features that this tool aims to provide are: 1. Capturing 'snapshots' of memory for each module during its execution. Specifically, at 8 points, during pre-forward, post-forward, pre-backward, 2nd pre-forward (if AC is applied), 2nd post-forward (if AC is applied), post-backward. Also capturing peak memory snapshot during forward and backward. 2. Each such snapshot provides the per device (cpu, cuda etc) memory breakdown in terms of the global parameters, gradients, activations, optimizer states and temporary memory. 3. A summary for each module (that can be analyzed or processed later), in terms of the memory occupied by its own parameters, buffers, inputs and outputs. The remaining components can be derived from these per module attributes and its corresponding captured snapshots. 4. Record the global peak memory consumption per device and their respective breakdowns. 5. Ability to do all of this under the FakeTensorMode so that all these statistics can be obtained without executing code on real data. 6. Ability to register and track modules, optimizers and any other tensors that are created outside the context of MemTracker. 7. Ability to capture a custom memory snapshot at any point during program execution execution. 8. Utility functions to display all of these statistics in user-friendly and human readable manner. These features will enable users to anticipate OOMs, debug and pinpoint where majority of memory comes from, experiment with different activation checkpointing policies, batch sizes, mixed precision, model architecture features (ex. number of layers, hidden dimensions, number of attention heads etc.) and inter-device memory movement (ex. CPU off-loading) among others. Basically anything and everything related to device memory. * __->__ #128508 Example: > import torch > import torchvision.models as models > from torch.distributed._tools.mem_tracker import MemTracker > device, dtype = "cuda", torch.float32 > with torch.device(device): > model = models.resnet18().to(dtype=dtype) > optim = torch.optim.Adam(model.parameters(), foreach=True) > mem_tracker = MemTracker() > mem_tracker.track_external(model, optim) > with mem_tracker as mt: > for i in range(2): > input_batch = torch.randn(256, 3, 224, 224, device=device, dtype=dtype) > model(input_batch).sum().backward() > optim.step() > optim.zero_grad() > if i == 0: > # to account for lazy init of optimizer state > mt.reset_mod_stats() > mt.display_snapshot("peak", units="MiB", tabulate=True) > mt.display_modulewise_snapshots(depth=2, units="MiB", tabulate=True) > # Check for accuracy of peak memory > tracker_max = mt.get_tracker_snapshot('peak')[device]['Total'] > cuda_max = torch.cuda.max_memory_allocated() > accuracy = tracker_max / cuda_max > print(f"Tracker Max: {tracker_max}, CUDA Max: {cuda_max}, Accuracy: {accuracy}") Output <img width="1197" alt="Screenshot 2024-06-15 at 12 10 12 AM" src="https://github.com/pytorch/pytorch/assets/12934972/83e953db-43dc-4094-90eb-9f1d2ca8e758"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124688 Approved by: https://github.com/awgu	2024-06-21 07:15:32 +00:00
PaliC	2b1b055a96	[Split Build] Fix libtorch_python RPATH (#129088 ) In the split build we end up with an incorrect RPATH for `libtorch_python.so`. This PR fixes said RPATH. What the rpath should look like: ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/main_so_files/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib: ``` Before ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/split_so_files/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /home/sahanp/pytorch/torch/lib:/home/sahanp/pytorch/build/lib: ``` After ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p build/lib/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/pytorch/torch/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib: ``` Testing that this works is in the above PR. Similarly, after running ciflow/binaries the output of objdump -p should not change https://www.diffchecker.com/14PRmCNz/ (checked manywheel py 3.10 cuda 12.1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129088 Approved by: https://github.com/malfet	2024-06-21 06:49:19 +00:00
Animesh Jain	c008488b9c	[dynamo][guards] Dont run TYPE_MATCH for DICT_LENGTH C++ guard (#129163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129163 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-21 06:27:19 +00:00
cyy	5c676bb8b3	Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-06-21 06:16:44 +00:00
Colin L. Rice	3a2fdbb142	[dynamo] - Add JK killswitch for dynamo compilation. (#128538 ) This allows easy disablement of dynamo in emergency situations where env variables are hard to set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128538 Approved by: https://github.com/jansel	2024-06-21 06:14:06 +00:00
PyTorch MergeBot	f73b451e78	Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013 )" This reverts commit ff89ebc50a738c734496393dc25313cf197fd0b4. Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/huydhn due to Sorry for reverting your change but one of the test_torchinductor_opinfo test starts to fail after this commit `ff89ebc50a`, I am reverting to see if it helps trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2182042422))	2024-06-21 05:46:46 +00:00
Deng Weishi	b542825066	Enable deterministic support for oneDNN (#127277 ) This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848. For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui	2024-06-21 05:21:24 +00:00
Animesh Jain	e8dbb45e98	[dynamo][user-defined-object] Check that object is valid (#129117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129117 Approved by: https://github.com/yf225	2024-06-21 04:18:54 +00:00
cyy	e99a24ce7c	Remove TensorImpl_test.cpp (#129054 ) It's not used because of removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129054 Approved by: https://github.com/albanD, https://github.com/malfet	2024-06-21 04:17:36 +00:00
Brian Hirsh	880e894c39	[Brian's PR #128981 ] fix dynamo isinstance inlining for nn.Parameter + subclasses (#129162 ) This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128981, with very small changes to work around numpy related errors. For discussions, please see Brian's original PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129162 Approved by: https://github.com/bdhirsh	2024-06-21 03:48:10 +00:00
eellison	8cd9b10456	Fix exp decomp numerics (#129154 ) Our previous implementation would sometimes generate `inf` because we did not do the same numerics tricks as in eager: See comment / [link](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/TransformationHelper.h#L123-L144) : ``` # curand_uniform has (0,1] bounds. log(1) is 0 and exponential excludes 0. # we need log to be not 0, and not underflow when converted to half # fast __logf approximation can underflow, so set log to -epsilon/2 for 1 or close to 1 args ``` Fix for https://github.com/pytorch/pytorch/issues/127749. Added a test for non-inf, but it would be great to have more robust decomp distribution tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129154 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2024-06-21 03:21:30 +00:00
chilli	ff89ebc50a	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-21 03:01:16 +00:00
Zain Huda	0acd09aecd	[torchrec][pt-d][model store] introduce LocalShardsWrapper for DTensor (#129150 ) Summary: Same as D57688538, recreated because of GH issues This diff introduces LocalShardsWrapper which is crucial to migrating from using ShardedTensor to DTensor in TRec state dict representation. As well as any changes needed in PT-D and ModelStore to support this. It allows us to extend DTensor to support multiple shards on a rank as well as empty shards on a rank as needed by TRec sharding logic. This diff also extends the support for LocalShardsWrapper to be used in conjunction with DTensor in checkpointing cases (ModelStore and DCP) See D54375878 for how it is used. LocalShardsWrapper supports the following torch ops: + torch.ops._c10d_functional.all_gather_into_tensor.default + aten._to_copy.default + aten.view.default + aten.equal.default + aten.detach.default With extensibility to add more as required by use cases. See https://docs.google.com/document/d/16Ptl50mGFJW2cljdF2HQ6FwsiA0scwbAbjx_4dhabJw/edit?usp=drivesdk for more info regarding design and approach. NOTE: This version of LocalShardsWrapper does not support empty shards, that is added in the next diff enabling CW. D57063512 Test Plan: ` buck test mode/opt -c python.package_style=inplace aiplatform/modelstore/client/tests_gpu:dist_checkpoint_save_load_with_stateful_tests -- --print-passing-details` `buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_tensor_configs -- --print-passing-details` Sandcastle Reviewed By: XilunWu, wanchaol Differential Revision: D58570479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129150 Approved by: https://github.com/XilunWu	2024-06-21 01:58:51 +00:00
wz337	31c9e3d2f4	[FSDP][Test] Test save model save with FSDP1 and load into FSDP2 applied model (#129028 ) A lot of models have already been saving the model state in FULL_STATE_DICT mode with FSDP1 in APF. This unit test is just to demonstrate FSDP1 -> FSDP2 transition. The use of deprecating APIs in this test is intentional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129028 Approved by: https://github.com/awgu, https://github.com/fegin	2024-06-21 01:40:58 +00:00
Pian Pawakapan	8758fedbfc	[export] copy sym ops when respecting call module signature (#129153 ) Summary: Export, through AOTAutograd, [deduplicates](`11ff5345d2/torch/fx/experimental/proxy_tensor.py (L198)`) sym_size calls, which can cause issues during unflattening when the sym_size node is used in multiple submodules. If preserve_call_module_signature is set, these nodes can't be passed between submodules as placeholders, so the calls (and any downstream un-duplicated nodes) must be copied. Adding this to unflattener Test Plan: export unflatten test case Reviewed By: TroyGarden, angelayi Differential Revision: D58697231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129153 Approved by: https://github.com/angelayi	2024-06-21 01:40:22 +00:00
Valentine233	5da428d9eb	[cpu][flash attention] fix attention mask issue (#128816 ) For attention mask in flash attention: - Fix the issue of accessing illegal memory when the last size of mask is 1. - Add UT of attention mask for various shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128816 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-06-21 01:12:48 +00:00
PyTorch MergeBot	d4022b4658	Revert "[BE] enable UFMT for `torch/nn/modules` (#128594 )" This reverts commit 95ac2d648279ebc73feccf6d8eccafa4b2759de8. Reverted https://github.com/pytorch/pytorch/pull/128594 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128594#issuecomment-2181788935))	2024-06-21 00:50:08 +00:00
PyTorch MergeBot	cc8193c707	Revert "[BE] enable UFMT for `torch/nn/functional.py` (#128592 )" This reverts commit f6e6e55fa7d883a89ba99584f8632c260519ba73. Reverted https://github.com/pytorch/pytorch/pull/128592 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128592#issuecomment-2181783936))	2024-06-21 00:44:16 +00:00
PyTorch MergeBot	9c929f6ce9	Revert "[BE][Easy] enable UFMT for `torch/distributed/` (#128870 )" This reverts commit a0e1e20c4157bb3e537fc784a51d7aef1e754157. Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))	2024-06-21 00:38:28 +00:00
Jiong Gong	9dd8f8cf8b	[cpuinfo][submodule] bump cpuinfo to the latest to support amx isa check (#127505 ) Fix https://github.com/pytorch/pytorch/issues/127368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127505 Approved by: https://github.com/ezyang	2024-06-21 00:17:44 +00:00
Myungjin Lee	c027c8935b	[distributed] NCCL result code update (#128777 ) The nccl result codes are outdated. This PR fixes #128756. Fixes #128756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128777 Approved by: https://github.com/Skylion007	2024-06-20 23:51:39 +00:00
Huy Do	43060a1dbc	Add shard support to test_inductor (#129160 ) I added one more shard for inductor tests earlier in https://github.com/pytorch/pytorch/pull/129108, but didn't realize that the second shard didn't do any inductor tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129160 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-06-20 23:41:00 +00:00
Joel Schlosser	31d5753247	Short-term fix to preserve NJT metadata cache in torch.compile (#122836 ) Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile. For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors. NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing. Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836 Approved by: https://github.com/soulitzer	2024-06-20 23:15:53 +00:00
PyTorch MergeBot	63a724d8e1	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit 8771e3429c3d7327f08c48d547ad73546d5603b3. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))	2024-06-20 22:31:29 +00:00
Jing Xu	5fba5d83f0	add xpu for amp (#127276 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to AMP doc. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127276 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/malfet	2024-06-20 21:49:35 +00:00
Jane Xu	adc14adb88	Fix flakiness with test_binary_op_list_error_cases (#129003 ) So how come this PR fixes any flakiness? Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky. Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach. So we improve the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003 Approved by: https://github.com/soulitzer	2024-06-20 21:48:22 +00:00
Thanh Ha	61fa3de4cb	ci: Hardcode runner-determinator (#128985 ) Hardcode the runner-determinator script for testing ALI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128985 Approved by: https://github.com/ZainRizvi	2024-06-20 21:14:23 +00:00
PyTorch MergeBot	aace8ffc00	Revert "[BE] enable UFMT for `torch/nn/*.py` (#128593 )" This reverts commit a87d82abd746240e7b46b992fa9df7ae6d3e6d4a. Reverted https://github.com/pytorch/pytorch/pull/128593 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128593#issuecomment-2181562604))	2024-06-20 21:09:44 +00:00
Animesh Jain	f2f4dde2d3	[dynamo] Remove ID_MATCH for FSDPModuleVariable (#129015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129015 Approved by: https://github.com/yf225 ghstack dependencies: #129098	2024-06-20 19:23:32 +00:00
PyTorch MergeBot	e84cf805d2	Revert "Modularize aten parameter parser and checker (#125308 )" This reverts commit 60bbdc0b40656cf70b2b098c7d715e19f031fb0d. Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))	2024-06-20 18:52:05 +00:00
PyTorch MergeBot	254487f288	Revert "Separate AOTI Eager utils as a single file (#125819 )" This reverts commit 18634048a1f939a961b7c96b0acfe78b474c821e. Reverted https://github.com/pytorch/pytorch/pull/125819 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125819#issuecomment-2181317332))	2024-06-20 18:49:08 +00:00
PyTorch MergeBot	73340f0909	Revert "[3/N] Non-Tensor: Support string parameter for aten operations (#125831 )" This reverts commit a52c8ace98afe76dc9e2c330b415972fd1529077. Reverted https://github.com/pytorch/pytorch/pull/125831 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125831#issuecomment-2181313892))	2024-06-20 18:45:41 +00:00
Brian Hirsh	8c2542623b	[Traceable FSDP2] [Dynamo] Add tracing support for out-variant custom ops that return None (#129078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129078 Approved by: https://github.com/yanboliang	2024-06-20 17:46:13 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	734891ac22	Fix export log script (#128967 ) Summary: Title Test Plan: CI Differential Revision: D58699557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128967 Approved by: https://github.com/jiashenC	2024-06-20 17:01:00 +00:00
Tijmen Blankevoort	ddb95dbb0d	Fixing equalize with three things and improving functionality (#124632 ) Summary: (1) Make code work when a first layer does not have a bias. (2) Make it possible to provide both modules and module names as input (3) Allow sequences of contiguous layers as input, that then get split into pairs (4) fix documentation to be more clear on inputs to be provided Test Plan: Run this new version of the algorithm on a network and see if it throws errors. There's also this notebook to run and test N5199827 It you tell me where I can find the tests for this code, I can add some simple unit tests as well. Differential Revision: D55895862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124632 Approved by: https://github.com/jerryzh168	2024-06-20 16:55:56 +00:00
PyTorch MergeBot	832fc35211	Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013 )" This reverts commit 6d2b3c90f144d7b77d51da27e6696192b2b97ebd. Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing a flexattention test to fail on ROCm. Can you please fix that test before remerging this in? See `6d2b3c90f1` for details ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2181133070))	2024-06-20 16:51:41 +00:00
Zhengxu Chen	65286883d4	[export] reland "experimental joint graph API." (#129081 ) Summary: previous diff got reverted despite CI was green. Test Plan: CI Differential Revision: D58790048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129081 Approved by: https://github.com/tugsbayasgalan	2024-06-20 16:50:53 +00:00
PaliC	fc5b0ff2d7	[BE][Hackaday] deprecate legacy cuda docker image (#128859 ) Fixes https://github.com/pytorch/builder/issues/1795 from the pytorch side specifically for the cuda image Pull Request resolved: https://github.com/pytorch/pytorch/pull/128859 Approved by: https://github.com/atalman	2024-06-20 16:30:49 +00:00
Nikita Shulga	b2a9b8d485	[CpuInductor] Enable NEON ISA detection on Linux ARM (#129075 ) Also, cleanup code a bit to use `x in [y, z]` instead of `x == y or x == z` And do not redefine `at_align`, but instead use `alignas(64)` as was suggested in https://github.com/pytorch/pytorch/pull/128686/files#r1639365978 Test plan: `python3 -c "import torch._inductor.codecache as cc; isa = cc.valid_vec_isa_list()[0];print(str(isa), bool(isa))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129075 Approved by: https://github.com/jansel	2024-06-20 16:22:57 +00:00
Huy Do	e0aa992d73	Fix inductor and deploy jobs timing out (#129108 ) Some trunk and periodic jobs are timing out at the moment, including: * `deploy`. This is because https://github.com/pytorch/pytorch/pull/127952 has removed `deploy` config, but there is one left over in periodic. * [periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu](https://github.com/pytorch/pytorch/actions/runs/9525590191/job/26260620457). * `inductor`, including `py3.10`, `py3.12`, and `cuda12.1`, `cuda12.4`. The increase comes from this change https://github.com/pytorch/pytorch/pull/128343, so I add another GPU shard. * [inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9522817887/job/26255069269) * [inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9524651902/job/26260009757) * [inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440205869) * [inductor-cu124 / cuda12.4-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440634200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129108 Approved by: https://github.com/malfet	2024-06-20 16:03:11 +00:00
Joel Schlosser	2bb8ee602b	Fix DEBUG=1 asserts with NJT ops (#129014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-06-20 15:15:28 +00:00
rzou	7178b4e987	[Dynamo x torch_function] fix incorrect source (#128980 ) Fixes https://github.com/pytorch/pytorch/issues/128964 The problem was that we were installing the source for a type incorrectly. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/128980 Approved by: https://github.com/mlazos	2024-06-20 14:54:00 +00:00
Animesh Jain	ea47d542ca	[dynamo][guards] Remove BOOL_FALSE - not needed after C++ guards (#129098 ) PyDict_Size is very fast ... earlier with Python guards, Cpython will go through layers of fluff to finally call the PyDict_Size. With C++ guards, its not needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129098 Approved by: https://github.com/jansel	2024-06-20 14:40:27 +00:00
Oguz Ulgen	54b0006cb2	Evaluate symexprs on load path of cache not write (#128997 ) When caching is enabled, an internal model fails with ``` assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1)) AssertionError: expected size 17==17, stride 57344==54784 at dim=0 ``` looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected. This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path. I have not been able to come up with a unit test repro for this problem. Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997 Approved by: https://github.com/ezyang	2024-06-20 08:55:12 +00:00
Li-Huai (Allan) Lin	799acd31b4	[MPS] Add lu_factor (#99269 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at d75cde1</samp> Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269 Approved by: https://github.com/kulinseth, https://github.com/lezcano	2024-06-20 07:35:29 +00:00
Nikita Shulga	0d25f096c1	[CppInductor] Fix erfinv codegen when non-vectorized isa (#129090 ) Fix erfinv codegen when ISA could not be detected Manual test plan (on MacOS): - Modify `valid_vec_isa_list` to return empty list - Run `python3 inductor/test_torchinductor_opinfo.py -v -k test_comprehensive_erfinv_cpu_bool` Before this change, abovementioned test will fail with ``` Output: /var/folders/rk/fxg20zvx6vvb5bk7cplq4xrc0000gn/T/tmpgic60b6c/ns/cnsp7snp7fyclkm5lsfiyiv3m6c3svevkbhcb3v7pijdfjwlyaij.cpp:11:25: error: use of undeclared identifier 'calc_erfinv' auto tmp2 = calc_erfinv(tmp1); ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129090 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-20 06:09:48 +00:00
chilli	6d2b3c90f1	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-20 05:15:48 +00:00
Will Feng	ad2593cb86	[Animesh's PR #125340 ] [dynamo][fsdp] Track FSDPNNModuleVariable for mutations (#129045 ) This is a copy of Animesh's work in https://github.com/pytorch/pytorch/pull/125340, with very small changes to the unit test. It's needed sooner for the Traceable FSDP2 work, so I copy it here and will work through landing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129045 Approved by: https://github.com/anijain2305	2024-06-20 04:02:36 +00:00
Li-Huai (Allan) Lin	19f3abcde4	[Docs][MPS] Add mps environment variable table (#129008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129008 Approved by: https://github.com/malfet ghstack dependencies: #129006	2024-06-20 03:30:35 +00:00
Huy Do	609ffaf717	Add more shards for slow CPU and ROCm jobs (#128873 ) As they start to timeout in trunk `fc2913fb80/1`. Adding one more shard for slow CPU job is trivial. ROCm runners is harder to find, but I assume that this is ok because slow jobs only run periodically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128873 Approved by: https://github.com/PaliC	2024-06-20 03:13:19 +00:00
Will Feng	d8db074988	[Traceable FSDP2] [Dynamo] Fix OptimizedModule._initialize to allow tracing into FSDP2 module hooks for module from user-defined module class (#129046 ) This is a workaround to allow inplace fully-sharded module to still go into this branch: `3a185778ed/torch/_dynamo/eval_frame.py (L163)` instead of the second branch: `3a185778ed/torch/_dynamo/eval_frame.py (L166)` If we don't do this, `torch.compile(fully_shard(module_from_user_defined_module_class))` will ignore all module hooks which will break FSDP tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129046 Approved by: https://github.com/anijain2305	2024-06-20 00:15:55 +00:00
Peter Bell	859fa183fe	BE: Use future annotations in inductor scheduler and ir (#128892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128892 Approved by: https://github.com/lezcano	2024-06-20 00:10:43 +00:00
chilli	a2b1673dfb	[Horace's PR #126446 ] Prevent partitioner from ever saving views (#129039 ) Most work is done by Horace in https://github.com/pytorch/pytorch/issues/126446, this PR just additionally adds the config for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129039 Approved by: https://github.com/Chillee	2024-06-19 23:21:16 +00:00
leslie-fang-intel	9d06e3783d	[Inductor][CPP] Fix the symbolic size cast issue in GEMM Benchmark (#128824 ) Summary The symbolic size generated from size hint (python int) is different with c type `long` of kernel args which may cause the benchmark failing to run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128824 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-19 23:11:53 +00:00
Jithun Nair	a6ac6447b5	Re-enable py3.12 nightly wheel builds and add triton dependency for ROCm (#128525 ) The llnl-hatchet developers have published the py3.12 binaries on [PyPI](https://pypi.org/project/llnl-hatchet/#files). In fact, looking [here](https://download.pytorch.org/whl/nightly/llnl-hatchet), it seems we already have the py3.12 wheels mirrored. This should allow us to re-enable py3.12 binaries for ROCm. This PR reverts commit 9d849d4312cd1e62d97b9e9d58979ec78d36c95f. It also adds the pytorch-triton-rocm dependency for torch wheels on ROCm since pytorch-triton-rocm py3.12 wheels are available now Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128525 Approved by: https://github.com/malfet	2024-06-19 21:56:54 +00:00
Sam Larsen	571a0db132	[inductor] Fix logging for run_and_get_cpp_code (#128794 ) Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794 Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel	2024-06-19 21:32:34 +00:00
cyy	277f2914a5	[9/N] Remove unused functions (#128704 ) MKL can not be enabled on aarch64, and as CI compiles code with `-Werror=unused-function` it will fail to compile with ``` /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/third_party/foxi -I/var/lib/jenkins/workspace/build/third_party/foxi -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-2.1.0 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/FP16/include -I/var/lib/jenkins/workspace/third_party/tensorpipe -I/var/lib/jenkins/workspace/build/third_party/tensorpipe -I/var/lib/jenkins/workspace/third_party/tensorpipe/third_party/libnop/include -I/var/lib/jenkins/workspace/third_party/fmt/include -I/var/lib/jenkins/workspace/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/workspace/third_party/flatbuffers/include -isystem /var/lib/jenkins/workspace/build/third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/workspace/third_party/protobuf/src -isystem /var/lib/jenkins/workspace/third_party/XNNPACK/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/eigen -isystem /var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/workspace/third_party/ideep/include -isystem /var/lib/jenkins/workspace/build/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Werror -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp:426:15: error: ‘at::Tensor at::native::mkl_linear(const at::Tensor&, const at::Tensor&, const at::Tensor&, const std::optional<at::Tensor>&, int64_t)’ defined but not used [-Werror=unused-function] 426 \| static Tensor mkl_linear( \| ^~~~~~~~~~ ``` Follows #128499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128704 Approved by: https://github.com/malfet	2024-06-19 20:46:45 +00:00
Aleksei Nikiforov	fca408fa29	s390x vectorization: rework operators (#129066 ) Move operators from member functions to free functions. This is needed to fix torch inductor on s390x. This change fixes tests like DynamicShapesMiscTests::test_numpy_min_dynamic_shapes from test/dynamo/test_dynamic_shapes.py This change also fixes recently intorduced build failure on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129066 Approved by: https://github.com/malfet	2024-06-19 20:12:41 +00:00
Huy Do	73f5d2b787	Run ET unit tests on PT CI (#128560 ) This is the first PR to add all existing ET unit tests into PT CI. The goal is to improve the coverage there to avoid breaking change from PT that could break ET. With this, any future unit tests on ET will automatically be run on PT CI. The duration of the job is now 40+ minutes, not too bad. This also fixed the failed ET build in https://github.com/pytorch/pytorch/pull/123043. Adding model coverage is a bit more evolved and requires adding new shards, so I will follow up on that in separate PRs. [T192117506](https://www.internalfb.com/intern/tasks/?t=192117506), with the failed diffs D58295865 and D58394154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128560 Approved by: https://github.com/guangy10, https://github.com/digantdesai	2024-06-19 20:08:58 +00:00
PyTorch MergeBot	df94d57c0a	Revert "[export] experimental joint graph API. (#128847 )" This reverts commit 0707811286d1846209676435f4f86f2b4b3d1a17. Reverted https://github.com/pytorch/pytorch/pull/128847 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128847#issuecomment-2179326891))	2024-06-19 19:04:36 +00:00
Aaron Enye Shi	b5d541609d	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI Pulled By: aaronenyeshi Differential Revision: D55941362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072 Approved by: https://github.com/zdevito	2024-06-19 18:05:41 +00:00
Xu Han	bafd68b4fc	[inductor] fix windows python module ext and func export declaration (#129059 ) I have run the first inductor case on Windows base on the exploration code: https://github.com/pytorch/pytorch/pull/128330 Due to some fundamental PR still need pass `fb_code`: https://github.com/pytorch/pytorch/pull/128303 This PR would land some part of exploration code: 1. Fix Windows python module ext type: pyd. 2. Add function export declaration for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129059 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-19 17:51:32 +00:00
Zhengxu Chen	0707811286	[export] experimental joint graph API. (#128847 ) Summary: WARNING: This API is highly unstable and will be subject to change in the future. Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph. Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental Differential Revision: D55657917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847 Approved by: https://github.com/tugsbayasgalan	2024-06-19 16:45:27 +00:00
Li-Huai (Allan) Lin	0fc603ece4	[optim] Fused implementation stability table (#129006 ) I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006 Approved by: https://github.com/malfet	2024-06-19 16:29:49 +00:00
Jean Schmidt	1b92bdd0ea	[ALI] [Reland] Use LF runners for Lint (#129071 ) Quick experiment with using LF runners for lint jobs. Picking a set of jobs where infra failures would be obvious to most people (lint) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129071 Approved by: https://github.com/malfet	2024-06-19 16:10:51 +00:00
PaliC	236fbcbdf4	[Split Build] Test split build in pull CI workflow (#126813 ) This PR builds the split build in the pull workflow and runs the appropriate tests against them. A single linux cpu and single gpu build were chosen arbitrarily to not add too many tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126813 Approved by: https://github.com/atalman ghstack dependencies: #127934	2024-06-19 15:57:21 +00:00
PaliC	7d33ff59ba	[Split Build]Use same package (#127934 ) This PR removes the second separate package we were using for the libtorch wheel. In terms of testing that this works we will look use the PRs above this in the stack. As for sanity checking these are the wheels that are produced by running ``` python setup.py clean && BUILD_LIBTORCH_WHL=1 with-proxy python setup.py bdist_whee l && BUILD_PYTHON_ONLY=1 with-proxy python setup.py bdist_wheel --cmake ``` ``` sahanp@devgpu086 ~/pytorch ((5f15e171…))> ls -al dist/ (pytorch-3.10) total 677236 drwxr-xr-x 1 sahanp users 188 Jun 4 12:19 ./ drwxr-xr-x 1 sahanp users 1696 Jun 4 12:59 ../ -rw-r--r-- 1 sahanp users 81405742 Jun 4 12:19 torch-2.4.0a0+gitca0a73c-cp310-cp310-linux_x86_64.whl -rw-r--r-- 1 sahanp users 612076919 Jun 4 12:19 libtorch-2.4.0a0+gitca0a73c-py3-none-any.whl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127934 Approved by: https://github.com/atalman	2024-06-19 15:57:21 +00:00
lyb	ffb50fb691	[ONNX] Add onnx::Gelu support for version 20 (#128773 ) Fixes https://github.com/pytorch/pytorch/issues/128772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128773 Approved by: https://github.com/justinchuby	2024-06-19 15:39:02 +00:00
Jean Schmidt	3397d5ef90	Revert "[ALI] Use lf runners for Lint" (#129070 ) Reverts pytorch/pytorch#128978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129070 Approved by: https://github.com/atalman	2024-06-19 14:48:16 +00:00
Xu Zhao	118f9ceb7c	[inductor][ci] Fix torchbench dependency issue with numpy (#128968 ) For some reason, pip will always upgrade the numpy version even when an older version has been installed. We have to lock numpy version to the old version to make this constraint explicit. Torchbench commit: `23512dbebd` Second attempt to fix #128845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128968 Approved by: https://github.com/eellison	2024-06-19 12:10:50 +00:00
FFFrog	e49525275d	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-19 09:06:49 +00:00
Zain Rizvi	7fac03aee9	[ALI] Use lf runners for Lint (#128978 )	2024-06-19 10:59:07 +02:00
Daulet Askarov	50567f7081	Pass device to is_pinned call inside TensorProperties.create_from_tensor (#128896 ) Summary: The default input device for is_pinned function is Cuda. This can unnecessarily create Cuda context for CPU tensors when just generating TensorProperties, bloating memory usage. Passing the device to the is_pinned call site inside def create_from_tensor solves this issue. This also fixes Model Store test https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 which is currently broken on memory usage assertions. Test Plan: UT Differential Revision: D58695006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128896 Approved by: https://github.com/fegin	2024-06-19 08:50:46 +00:00
Frank Lin	d3e8b8bf47	Remove cuda check in the CUDAGraph destructor (#127382 ) Fixes #125804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127382 Approved by: https://github.com/eqy, https://github.com/eellison	2024-06-19 08:09:31 +00:00
Bin Bao	ba92f5277f	[inductor][refactor] Unify the use of generate_kernel_call (#128467 ) Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack. Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467 Approved by: https://github.com/shunting314	2024-06-19 07:47:25 +00:00
Colin Peppler	3a185778ed	[aotinductor] Add torch.polar fallback op for shim v2 (#128722 ) Compilation error: ``` $ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar /tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’? ``` Steps: 1. Add aten.polar 2. run `python torchgen/gen.py --update-aoti-c-shim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-19 05:06:58 +00:00
PyTorch MergeBot	a584b2a389	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit df85f34a14dd30f784418624b05bd52b12ab8b0b. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to The failure shows up in trunk `df85f34a14` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2177744578))	2024-06-19 04:59:10 +00:00
drisspg	fcf2a1378b	Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989 ) # Summary First PR got reverted and needed a redo This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989 Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo	2024-06-19 04:49:39 +00:00
Sam Larsen	2f88597aad	[inductor] For internal, allow multiple workers if the method is "subprocess" (#129002 ) Summary: This does not change the current default behavior in fbcode ("fork" if unspecified and no worker processes if unspecified). But it allows us to more easily test the subprocess-based parallel if we override the start method to subprocess. Test Plan: Set `TORCHINDUCTOR_WORKER_START=subprocess` and locally ran all torchbench models listed [here](https://www.internalfb.com/intern/wiki/PyTorch/Teams/PyTorch_Perf_Infra/TorchBench/#torchbench-internal-mode) Differential Revision: D58755021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129002 Approved by: https://github.com/eellison	2024-06-19 04:28:27 +00:00
Jerry Mannil	1f0a68b572	[ROCm] Fix fp32 atomicAdd for non-MI100 GPUs (#128750 ) Current implementation is very specific to MI100. This is causing performance degradation for other GPUs. Fixes #128631 Benchmarking on MI300X: ``` Before: 1918.5126953125 ms After: 0.8285150527954102 ms ``` Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128750 Approved by: https://github.com/xw285cornell	2024-06-19 03:56:20 +00:00
Yanbo Liang	acefc5c016	[torch.compile] Enable bwd compilation metrics (#128973 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128973 Approved by: https://github.com/dshi7	2024-06-19 03:45:41 +00:00
chilli	eb9f4da11e	Modified template indexing to broadcast indices to out instead of mask and some other flexattention micro-opts (#128938 ) For headdim=64 and headdim=128 Old: <img width="656" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/2c5d1613-96dc-4300-8dc0-dccaef59e73c"> New: <img width="644" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/730004a8-6d5f-46a5-82a0-2594feb5e192"> Note, this does regress headdim=256. We can unregress it by special casing `headdim=256`, but ehh.... we can do it later Pull Request resolved: https://github.com/pytorch/pytorch/pull/128938 Approved by: https://github.com/drisspg	2024-06-19 03:41:22 +00:00
Yifu Wang	8771e3429c	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-19 03:38:58 +00:00
Alnis Murtovi	ed5b8432cd	Enable mixed_mm only if casting from lower-bitwidth type to a higher one (#128899 ) This PR changes the behavior of `cuda_and_enabled_mixed_mm` such that mixed_mm is only enabled if we are casting from a lower-bitwidth type to a higher one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128899 Approved by: https://github.com/eellison	2024-06-19 03:12:18 +00:00
Wu, Chunyuan	df85f34a14	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-19 01:18:37 +00:00
Thanh Ha	4bc90185fb	fix: Print statements causing parse error (#128969 ) The print statements for the get_workflow_type script is problematic because the shell script calling this script is expecting the output to only be JSON. This PR resolves this by removing all print statements to covert them to a message field in the JSON return output so that the output can continue to expect to be JSON while giving us the debug data we are looking for. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128969 Approved by: https://github.com/tylertitsworth, https://github.com/ZainRizvi	2024-06-19 01:17:08 +00:00
leslie-fang-intel	eda375a490	[Inductor] Remove min/max from inductor opinfo test (#128925 ) Summary Remove `max.binary, min.binary, maximum, minimum` from `inductor_one_sample` op list as we fix the bool vectorization issue in https://github.com/pytorch/pytorch/pull/126841. Test Plan ``` python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_maximum python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_minimum python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_min_binary python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_max_binary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128925 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-06-19 01:14:27 +00:00
xinan.lin	2458f79f83	[Inductor UT][Intel GPU] Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU (#128881 ) Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU because it have not implemented reduction kernel split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128881 Approved by: https://github.com/blaine-rister, https://github.com/EikanWang, https://github.com/malfet	2024-06-19 00:44:57 +00:00
PyTorch MergeBot	b0d2fe6299	Revert "Short-term fix to preserve NJT metadata cache in torch.compile (#122836 )" This reverts commit 2a41fc03903de63270d325bd1886a50faf32d7e4. Reverted https://github.com/pytorch/pytorch/pull/122836 on behalf of https://github.com/jbschlosser due to internal test failures with DEBUG=1 asserts ([comment](https://github.com/pytorch/pytorch/pull/122836#issuecomment-2177298245))	2024-06-19 00:28:53 +00:00
PyTorch MergeBot	5ffb032be6	Revert "Backward support for unbind() with NJT (#128032 )" This reverts commit 5dc4f652bc5c068ef15130c955e3f2ffe11f4b74. Reverted https://github.com/pytorch/pytorch/pull/128032 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128032#issuecomment-2177296325))	2024-06-19 00:26:40 +00:00
Jane Xu	35c78668b4	Improve the debugging message for when foreach mta_called (#128991 ) The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern: - a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called. - then, a later test fails deterministically, usually failing to compare two results. ``` ================== 1 failed, 241 deselected, 2 rerun in 1.76s ================== Got exit code 1 Stopping at first consistent failure The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16'] The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16'] ``` So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally. Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991 Approved by: https://github.com/clee2000	2024-06-19 00:25:09 +00:00
PyTorch MergeBot	99f042d336	Revert "Forward fix to skip ROCm tests for #122836 (#128891 )" This reverts commit 4061b3b8225f522ae0ed6db00111441e7d3cc3d5. Reverted https://github.com/pytorch/pytorch/pull/128891 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128891#issuecomment-2177291249))	2024-06-19 00:21:21 +00:00
Animesh Jain	670b94c9c8	[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484 Approved by: https://github.com/mlazos ghstack dependencies: #128428	2024-06-19 00:06:46 +00:00
Animesh Jain	c5e0b84484	[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428 Approved by: https://github.com/yanboliang, https://github.com/mlazos	2024-06-19 00:06:46 +00:00
cyy	cb5e9183c6	[Caffe2] [2/N] Remove Caffe2 from tests (#128911 ) Follows #128675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128911 Approved by: https://github.com/titaiwangms, https://github.com/r-barnes	2024-06-19 00:05:50 +00:00
Andrew Gu	ac5f565fa7	[FSDP2] Added `set_post_optim_event` (#128975 ) This PR adds `set_post_optim_event` that allows power users to provide their own CUDA event that is recorded after the optimizer step for the FSDP root module to wait the all-gather streams on. ``` def set_post_optim_event(self, event: torch.cuda.Event) -> None: ``` By default, the root would have the all-gather streams wait on the current stream (`wait_stream`), which may introduce false dependencies if there is unrelated computation after the optimizer step and before the wait. For example, this pattern can appear in recommendation models. To avoid those false dependencies while preserving the correctness guarantee, we provide this API so that the user can provide their own CUDA event to wait the all-gather streams on. We include both correctness test (`test_fully_shard_training.py`) and overlap test (`test_fully_shard_overlap.py`). --- One possible way to use the API is to register a post-step hook on the optimizer. For example: `12e8d1399b/test/distributed/_composable/fsdp/test_fully_shard_training.py (L546-L552)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128975 Approved by: https://github.com/sanketpurandare, https://github.com/weifengpy ghstack dependencies: #128884	2024-06-18 22:26:14 +00:00
Jokeren	d9c294c672	[Inductor] Fix arguments passed to triton kernel launch hooks (#128732 ) `binary.launch_enter_hook` is treated as an instance method and will add a `self` argument to the hooks. `CompiledKernel.launch_enter_hook` is a static method, which matches the hook calling convention of profilers (i.e., a single `LazyDict` argument only). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128732 Approved by: https://github.com/shunting314, https://github.com/bertmaher	2024-06-18 22:06:55 +00:00
Xuehai Pan	a0e1e20c41	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin ghstack dependencies: #128868, #128869	2024-06-18 21:49:08 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Xuehai Pan	cec31050b4	[BE][Easy] enable UFMT for `torch/distributed/{tensor,_tensor}/` (#128868 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128868 Approved by: https://github.com/fegin	2024-06-18 21:49:02 +00:00
Nikita Shulga	e47603a549	Fix weight_norm decomposition behavior (#128956 ) By upcasting norm to float32 to align with CUDA and CPU behaviors `e6d4451ae8/aten/src/ATen/native/WeightNorm.cpp (L56-L59)` Discovered this when started running OpInfo tests, see https://github.com/pytorch/pytorch/actions/runs/9552858711/job/26332062502#step:20:1060 ``` File "/var/lib/jenkins/workspace/test/test_decomp.py", line 185, in op_assert_ref assert orig.dtype == decomp.dtype, f"{i} Operation: {op}" AssertionError: 1 Operation: aten._weight_norm_interface.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128956 Approved by: https://github.com/albanD ghstack dependencies: #128955	2024-06-18 21:24:12 +00:00
Aaron Enye Shi	2227da4431	[Profiler] Clean up use_mtia to follow standard use_device instead (#126284 ) Summary: use_mtia should instead set use_device='mtia' similar to cuda, xpu, and privateuseone. Avoid an ever-growing list of use_* arguments. Since use_mtia is specific to FBCode, we don't need a deprecation warning. Test Plan: CI. Differential Revision: D57338005 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/126284 Approved by: https://github.com/fenypatel99	2024-06-18 21:01:03 +00:00
dependabot[bot]	4cc3fb5ee2	Bump urllib3 from 2.2.1 to 2.2.2 in /tools/build/bazel (#128908 ) Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.1 to 2.2.2. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2) --- updated-dependencies: - dependency-name: urllib3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-06-18 13:38:22 -07:00
Joel Schlosser	5dc4f652bc	Backward support for unbind() with NJT (#128032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032 Approved by: https://github.com/soulitzer	2024-06-18 20:29:00 +00:00
PyTorch MergeBot	44722c6b10	Revert "[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453 )" This reverts commit 2b28b107dbafeec18d1095a2002e79511aa241df. Reverted https://github.com/pytorch/pytorch/pull/128453 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
PyTorch MergeBot	1babeddbbf	Revert "[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 )" This reverts commit 1f6e84fa6852805e15ddc9583c5f36c3a7f93df8. Reverted https://github.com/pytorch/pytorch/pull/128484 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
PyTorch MergeBot	5bc9835d64	Revert "[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 )" This reverts commit c52eda896eb3ec7f8d04b6321861f4c5614a40bb. Reverted https://github.com/pytorch/pytorch/pull/128428 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
Li-Huai (Allan) Lin	9a7e2519d3	[MPS] Fused Adam & AdamW (#127242 ) Summary: This PR adds fused Adam and AdamW implementations. Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory: Fast math enabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 89 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 90 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 83 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 12 \| 94 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 88 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 12 \| 90 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 100 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 23 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 23 \| 98 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 480 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 72 \| 450 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 82 \| 450 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 73 \| 420 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 91 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 83 \| 400 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 78 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 170 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 140 \| 600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 170 \| 600 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 140 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 250 \| 890 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 220 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 250 \| 830 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 220 \| 770 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 270 \| 870 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 230 \| 840 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 270 \| 810 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 240 \| 800 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 400 \| 1000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 360 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 430 \| 2000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 360 \| 1300 Times are in milliseconds (ms). ``` Fast math disabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 79 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 93 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 10 \| 90 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 91 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 81 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 34 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 34 \| 95 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 92 \| 430 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 81 \| 390 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 98 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 88 \| 430 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 100 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 88 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 210 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 190 \| 610 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 210 \| 510 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 190 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 300 \| 900 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 260 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 295 \| 900 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 260 \| 800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 320 \| 910 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 280 \| 900 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 320 \| 900 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 300 \| 900 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 500 \| 2000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 480 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 540 \| 1500 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 480 \| 1200 Times are in milliseconds (ms). ``` ```python def profile_fused_adam(): from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)] if adamWflag: fn = adamw.adamw else: fn = adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label='Fused Adam', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242 Approved by: https://github.com/kulinseth, https://github.com/janeyx99	2024-06-18 19:59:50 +00:00
Chien-Chin Huang	fe8558b7aa	[DSD] Add unittest to verify HSDP1 + broadcast_from_rank0 (#128755 ) HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from https://github.com/pytorch/pytorch/pull/128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128755 Approved by: https://github.com/Skylion007, https://github.com/wz337 ghstack dependencies: #128685	2024-06-18 19:42:51 +00:00
Sam Larsen	abde6cab4c	Remove compile_threads=1 in test_inductor_collectives.py (#128580 ) Summary: I believe https://github.com/pytorch/pytorch/issues/125235 should be fixed after switching to subprocess-based parallel compile. Test Plan: Ran locally with python-3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128580 Approved by: https://github.com/eellison	2024-06-18 19:31:13 +00:00
Boyuan Feng	04a5d3228e	[ts migration] Support prim::tolist and aten::len (#128894 ) Support prim::tolist and aten::len. Add unit tests for prim::min. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128894 Approved by: https://github.com/angelayi	2024-06-18 19:11:07 +00:00
Nikita Shulga	44483972bd	[EZ] Keep weight_norm var name aligned (#128955 ) To keep it aligned with `e6d4451ae8/aten/src/ATen/native/native_functions.yaml (L6484)` I.e. `x`->`v`, `y`->`g` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128955 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-06-18 18:40:59 +00:00
Animesh Jain	bdffd9f0c6	[export] Graph break on nn.Parameter construction (#128935 ) Fixes https://github.com/pytorch/pytorch/issues/126109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128935 Approved by: https://github.com/angelayi	2024-06-18 18:37:44 +00:00
Chien-Chin Huang	1a527915a6	[DSD] Correctly handle shared parameters for optimizer state_dict (#128685 ) * Fixes https://github.com/pytorch/pytorch/issues/128011 See the discussion in https://github.com/pytorch/pytorch/pull/128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128685 Approved by: https://github.com/LucasLLC	2024-06-18 18:34:32 +00:00
loganthomas	d77a1aaa86	DOC: add note about same sized tensors to dist.gather() (#128676 ) Fixes #103305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128676 Approved by: https://github.com/wconstab	2024-06-18 18:26:07 +00:00
soulitzer	1877b7896c	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) ### bc-breaking for existing users of the private API: - Existing policy functions must now change their return value to be [CheckpointPolicy](`c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230)`) Enum instead of bool. - To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy. - Policy function now accepts a `ctx` object instead of `mode` for its first argument. - To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE : metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-18 18:18:50 +00:00
PyTorch MergeBot	77830d509f	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit 7a39755da28d5a109bf0c37f72b364d3a83137b1. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))	2024-06-18 18:11:43 +00:00
Huy Do	84c86e56bd	Update tracker issues after successfully cherry-picking a PR (#128924 ) This extends the capacity of the cherry-pick bot to automatically update the tracker issue with the information. For this to work, the tracker issue needs to be an open one with a `release tracker` label, i.e. https://github.com/pytorch/pytorch/issues/128436. The version from the release branch, i.e. `release/2.4`, will be match with the title of the tracker issue, i.e. `[v.2.4.0] Release Tracker` or `[v.2.4.1] Release Tracker` ### Testing `python cherry_pick.py --onto-branch release/2.4 --classification release --fixes "DEBUG DEBUG" --github-actor huydhn 128718` * On the PR https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174846771 * On the tracker issue https://github.com/pytorch/pytorch/issues/128436#issuecomment-2174846757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128924 Approved by: https://github.com/atalman	2024-06-18 17:48:47 +00:00
eqy	4e03263224	[CUDA][Convolution] Add missing launch bounds to `vol2col_kernel` (#128740 ) Fix "too many resources requested" that can happen with recent toolkits on V100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128740 Approved by: https://github.com/mikaylagawarecki	2024-06-18 17:26:23 +00:00
Kazuaki Ishizaki	26e374e3ca	[EZ] Fix typos in RELEASE.md (#128769 ) This PR fixes typo in `RELEASE.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128769 Approved by: https://github.com/yumium, https://github.com/mikaylagawarecki	2024-06-18 17:15:05 +00:00
Guilherme Leobas	9818283da1	re-enable jacrev/jacfwd/hessian after #128028 landed (#128622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128622 Approved by: https://github.com/zou3519	2024-06-18 17:08:58 +00:00
eqy	ec616da518	RNN API cleanup for cuDNN 9.1 (#122011 ) Can potentially avoid a bit of boilerplate if we move directly to cuDNN 9.1's RNN API... Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122011 Approved by: https://github.com/Skylion007	2024-06-18 16:16:38 +00:00
David Berard	108318ad10	[BE][JIT] Handle case where codegen object can be unset (#128951 ) Summary: Unblocks a test that's failing. `codegen` can be unset until `compile` is called. If `codegen` is not set, then just use the kernel name directly. Test Plan: ``` buck2 run //caffe2/test:tensorexpr -- --regex test_simple_add ``` Differential Revision: D58727391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128951 Approved by: https://github.com/aaronenyeshi	2024-06-18 15:40:45 +00:00
Isuru Fernando	4817180601	make fallback for aten.argsort.stable (#128907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128907 Approved by: https://github.com/lezcano ghstack dependencies: #128343	2024-06-18 14:56:35 +00:00
Xuehai Pan	22d258427b	[BE][Easy] enable UFMT for `torch/distributed/_shard/` (#128867 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128867 Approved by: https://github.com/fegin ghstack dependencies: #128866	2024-06-18 14:39:25 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Andrew Gu	f2805a0408	[FSDP2] Added APIs for explicit fwd/bwd prefetching (#128884 ) This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively. ``` def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None ``` Motivation FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order. However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)). The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do. Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs). Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884 Approved by: https://github.com/ckluk2, https://github.com/weifengpy	2024-06-18 13:32:57 +00:00
Ahmed Gheith	3dd5f0ecbb	Remove circular import (#128875 ) Summary: A spurious import is causing circular dependency errors Test Plan: phabricator signals Differential Revision: D58685676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128875 Approved by: https://github.com/kit1980	2024-06-18 12:30:13 +00:00
leslie-fang-intel	304c934572	Move MKLDNN Specific IR to Separate File (#126504 ) Summary Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file. Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504 Approved by: https://github.com/desertfire, https://github.com/jgong5 ghstack dependencies: #126841, #126940	2024-06-18 09:29:13 +00:00
Chien-Chin Huang	6e43897912	[BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843 ) Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512. Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843 Approved by: https://github.com/wz337	2024-06-18 07:05:31 +00:00
Chien-Chin Huang	60baeee59f	[BE] Skip the test if CUDA is not available (#128885 ) As title Differential Revision: [D58690210](https://our.internmc.facebook.com/intern/diff/D58690210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128885 Approved by: https://github.com/wz337	2024-06-18 07:02:44 +00:00
Will Feng	e3a39d49a0	[Traceable FSDP][Compiled Autograd] Add queue_callback() support (#126366 ) Adds support for `Variable._execution_engine.queue_callback()`, which is used in FSDP2. Important tests: - `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_callback_graph_break_throws_error` - `pytest -rA test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_callback_adds_callback` - `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_callback_adds_callback` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126366 Approved by: https://github.com/xmfan	2024-06-18 06:22:14 +00:00
Chirag Pandya	f7eae27946	Pass params to dump_nccl_trace_pickle (#128781 ) Summary Pass parameters from request to dump_nccl_trace_pickle handler. The supported parameters + value are all lowercase. includecollectives={true, false} includestacktraces={true, false} onlyactive={true, false} Example post is: /handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true Test Plan: unit tests Differential Revision: [D58640474](https://our.internmc.facebook.com/intern/diff/D58640474) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128781 Approved by: https://github.com/d4l3k	2024-06-18 03:46:57 +00:00
Joona Havukainen	d9eaa224f2	Fixes #128429 : NaN in triu op on MPS (#128575 ) Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead. Fixes #128429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575 Approved by: https://github.com/kulinseth	2024-06-18 03:44:42 +00:00
Tristan Rice	59b4983dc0	DebugPlane: add dump_traceback handler (#128904 ) This adds a `dump_traceback` handler so you can see all running threads for a job. This uses a temporary file as a buffer when calling `faulthandler.dump_traceback` and requires the GIL to be held during dumping. Test plan: ``` python test/distributed/elastic/test_control_plane.py -v -k traceback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128904 Approved by: https://github.com/c-p-i-o	2024-06-18 03:40:16 +00:00
Xu Han	17abbafdfc	[inductor] Fix some windows cpp builder issue (#128765 ) 1. fix some Windows build args. 2. fix c++20 likely issue on Windows, reference: https://github.com/pytorch/pytorch/pull/124997. 3. remove compiler return value check, different compilers return variant value, let's check exception to catch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128765 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-18 03:25:20 +00:00
Joel Schlosser	4061b3b822	Forward fix to skip ROCm tests for #122836 (#128891 ) Fixes broken ROCm tests from #122836. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128891 Approved by: https://github.com/huydhn ghstack dependencies: #127007, #128057, #122836	2024-06-18 03:01:19 +00:00
Animesh Jain	c017c97333	[dynamo][inlining-inbuilt-nn-modules] Update test output (#128880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128880 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748, #128877, #128878	2024-06-18 02:18:09 +00:00
Animesh Jain	4e97d37fd9	[inlining-inbuilt-nn-modules][pre-grad] Adjust efficient_conv_bn_eval_graph for inlining (#128878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128878 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748, #128877	2024-06-18 02:18:09 +00:00
Animesh Jain	22f1793c0a	[dynamo][easy] Use LazyVariableTracker for UserDefinedObject var_getattr (#128877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128877 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748	2024-06-18 02:17:56 +00:00
Boyuan Feng	43998711a7	[CUDAGraph] add more docs for cudagraph trees (#127963 ) This PR adds more documentation for CUDAGraph Trees, including - Iteration Support - Input Mutation Support - Dynamic Shape Support - NCCL Support - Reasons for Skipping CUDAGraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/127963 Approved by: https://github.com/eellison	2024-06-18 02:07:07 +00:00
Fuzzkatt	e12fa93b8b	add is_big_gpu(0) check to test_select_algorithm tests in tests/inductor/test_cuda_cpp_wrapper.py (#128652 ) In NVIDIA internal CI, on Jetson devices we are seeing this failure for `python test/inductor/test_cuda_cpp_wrapper.py -k test_addmm_cuda_cuda_wrapper -k test_linear_relu_cuda_cuda_wrapper`: ``` /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:132: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm mode frames [('total', 1), ('ok', 1)] stats [('calls_captured', 2), ('unique_graphs', 1)] inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1)] aot_autograd [('total', 1), ('ok', 1)] F ====================================================================== FAIL: test_linear_relu_cuda_cuda_wrapper (__main__.TestCudaWrapper) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 9818, in new_test return value(self) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/opt/pytorch/pytorch/test/inductor/test_cuda_cpp_wrapper.py", line 152, in fn _, code = test_torchinductor.run_and_get_cpp_code( File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 356, in run_and_get_cpp_code result = fn(args, *kwargs) File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 43, in wrapped return fn(args, *kwargs) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/usr/lib/python3.10/unittest/mock.py", line 1379, in patched return func(newargs, *newkeywargs) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, **kwds) File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 62, in test_linear_relu_cuda self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 3642, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Scalars are not equal! Expected 1 but got 0. Absolute difference: 1 Relative difference: 1.0 ``` Looking into it, we see the failure is from https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L62. The warning `W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm ` is triggered from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L973. Printing torch.cuda.get_device_properties(0).multi_processor_count returns 16 on the computelab AGX Orin; thus it makes sense that this check is failing, since the min_required_sms is 68, thus not letting it pick the autotune algorithm. Looking at the main for test_select_algorithm.py, we see that these tests should only be run if is_big_gpu(0) is true: https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L344. Thus this PR adds a similar check to the invocation of these tests in test_cuda_cpp_wrapper.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128652 Approved by: https://github.com/soulitzer, https://github.com/eqy	2024-06-18 02:00:04 +00:00
Huy Do	9e8443b56f	Remove dtype from gpt-fast micro benchmark experiments model name (#128789 ) Per comments on https://github.com/pytorch/test-infra/pull/5344, we already have a dtype column with the same information Pull Request resolved: https://github.com/pytorch/pytorch/pull/128789 Approved by: https://github.com/yanboliang	2024-06-18 01:26:45 +00:00
Shangdi Yu	fbc7559ceb	[custom ops] convert string type annotation to real type (#128809 ) Fixes #105157 Bug source: `from __future__ import annotations` converts type annotation to strings to make forwards references easier. However, existing custom ops do not consider strings to be valid types. Fix: We check if the argument and return type annotation is string type. If so, we try to use `eval` to convert it to a type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128809 Approved by: https://github.com/zou3519	2024-06-18 00:55:50 +00:00
leslie-fang-intel	c35ffaf954	[Inductor][CPP] Add ne with VecMask (#126940 ) Summary Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`. Test Plan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool ``` Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126841	2024-06-18 00:23:03 +00:00
leslie-fang-intel	beb29836cd	[Inductor][CPP] Add Min/Max with VecMask (#126841 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`. TestPlan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool ``` Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-06-18 00:20:32 +00:00
chilli	11ff5345d2	Changed colored logging to only be turned on if printing to interactive terminal (#128874 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128874 Approved by: https://github.com/anijain2305	2024-06-17 23:53:26 +00:00
awayzjj	b70440f0a7	Document the torch.cuda.profiler.profile function (#128216 ) Fixes https://github.com/pytorch/pytorch/issues/127901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128216 Approved by: https://github.com/malfet, https://github.com/eqy	2024-06-17 23:42:40 +00:00
Edward Z. Yang	95b5ea9cde	Add mark_unbacked (#128638 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128638 Approved by: https://github.com/IvanKobzarev	2024-06-17 23:39:48 +00:00
Xiaodong Wang	8415a4ba98	Back out "[ROCm] TunableOp for gemm_and_bias (#128143 )" (#128815 ) Summary: Original commit changeset: 35083f04fdae Original Phabricator Diff: D58501726 This PR is bringing a large numerical gap. e.g. for 256 x 4096 x 4096 GEMM, if we enable tunable op + DISABLE_ADDMM_HIP_LT=0, the results are way off. Differential Revision: D58660832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128815 Approved by: https://github.com/mxz297, https://github.com/eqy, https://github.com/malfet	2024-06-17 22:52:27 +00:00
atalman	3b8c9b8ab1	[Docker Release] Test if pytorch was compiled with CUDA before pushing to repo (#128852 ) Related to: https://github.com/pytorch/pytorch/issues/125879 Would check if we are compiled with CUDA before publishing CUDA Docker nightly image Test ``` #18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi #18 1.656 Is torch compiled with cuda: False #18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 ------ > [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi: 1.656 Is torch compiled with cuda: False ------ Dockerfile:80 -------------------- 79 \| RUN /opt/conda/bin/pip install torchelastic 80 \| >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\ 81 \| >>> echo "Is torch compiled with cuda: ${IS_CUDA}"; \ 82 \| >>> if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \ 83 \| >>> exit 1; \ 84 \| >>> fi 85 \| -------------------- ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 (base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain --platform="linux/amd64" --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" . #0 building with "default" instance using docker driver ``` Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet: https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128852 Approved by: https://github.com/malfet	2024-06-17 22:51:12 +00:00
Xu Zhao	1835e3beab	Fix the inductor ci (#128879 ) Fix the torchbench+inductor ci on trunk due to recent upgrade to numpy 2.0.0rc1. We have to remove DALLE2_pytorch model, since it depends on embedding-reader, which is not compatible with numpy>2: https://github.com/rom1504/embedding-reader/blob/main/requirements.txt#L3 Fixes #128845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128879 Approved by: https://github.com/eellison	2024-06-17 22:20:33 +00:00
Shengbao Zheng	7baf32b5e7	[c10d] fix p2p group commsplit (#128803 ) Summary: For PointToPoint(sendrecv), the deviceId is lower_rank:higher_rank. This means a p2p group cannot be created through commSplit since it cannot find a parent. Fix this by using the right device key of current rank. Differential Revision: D58631639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128803 Approved by: https://github.com/shuqiangzhang	2024-06-17 22:07:40 +00:00
Jun Luo	1fd7496ab2	[MTIA] Fix synchronize API (#128714 ) Reviewed By: fenypatel99 Differential Revision: D58590313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128714 Approved by: https://github.com/aaronenyeshi	2024-06-17 21:58:46 +00:00
cyy	163847b1bb	[1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675 Approved by: https://github.com/r-barnes	2024-06-17 21:25:59 +00:00
Yanbo Liang	8953725e6d	[Inductor][FlexAttention] Tune backwards kernel block sizes (#128853 ) This replaces #128767 which somehow closed by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128853 Approved by: https://github.com/angelayi	2024-06-17 21:10:55 +00:00
Yanbo Liang	a489792bb2	[GPT-benchmark] Fix memory bandwidth for MoE (#128783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128783 Approved by: https://github.com/Chillee ghstack dependencies: #128768	2024-06-17 21:04:57 +00:00
Yanbo Liang	8c06eae17e	[GPT-benchmark] Add metric: compilation time for GPT models (#128768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128768 Approved by: https://github.com/Chillee	2024-06-17 21:04:57 +00:00
Masaki Kozuki	a59766ee05	replace `AT_ERROR(...)` with `TORCH_CHECK(false, ...)` (#128788 ) as per title. encountered the old-fashioned by chance Pull Request resolved: https://github.com/pytorch/pytorch/pull/128788 Approved by: https://github.com/mikaylagawarecki	2024-06-17 20:50:22 +00:00
Kurman Karabukaev	0f89e66d17	Validate logs are created by default (#128522 ) Summary: Make sure that logs are caputured in default settings Test Plan: ci Differential Revision: D58395812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128522 Approved by: https://github.com/d4l3k	2024-06-17 20:07:13 +00:00
Huy Do	1577328ea4	Set bash shell on Windows (#128854 ) Attempt to fix the missing python3 command on the new Windows AMI https://github.com/pytorch/pytorch/actions/runs/9551494945/job/26325922503. I added the logic to copy python to python3 to make the command available, it worked with the previous AMI, but start to fail now and the cause is not clear (maybe it's not the AMI, but a new GitHub runner version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128854 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman	2024-06-17 19:24:09 +00:00
Mikayla Gawarecki	b181b58857	Fix Storage.filename to not track the filename when storage was mmap-ed with MAP_PRIVATE (#128725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128725 Approved by: https://github.com/albanD	2024-06-17 18:55:47 +00:00
Catherine Lee	213eba7d2e	Configure mergebot via config (#128840 ) Fixes #ISSUE_NUMBER * Companion to https://github.com/pytorch/test-infra/pull/5312 * See the above for details + possible risks * Without the above PR, this should have no effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/128840 Approved by: https://github.com/huydhn	2024-06-17 18:53:56 +00:00
PyTorch MergeBot	c172b58fe0	Revert "Update DALLE2_pytorch expected accuracy result on CPU (#128718 )" This reverts commit fd27138c4a86bd763a6b8128d940a7c98f951603. Reverted https://github.com/pytorch/pytorch/pull/128718 on behalf of https://github.com/huydhn due to This has reverted back to the previous expected value for some reason `153362fbc9` ([comment](https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174194219))	2024-06-17 18:49:15 +00:00
eellison	5344c41d43	Use forked torchbench branch with pinned numpy (#128856 ) Adds pinned numpy commit to yolov3 dependencies to the existing pinned commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128856 Approved by: https://github.com/huydhn, https://github.com/PaliC	2024-06-17 18:41:42 +00:00
cyy	d35cdee97f	[Caffe2] Remove caffe2 onnx tests (#128687 ) They are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128687 Approved by: https://github.com/r-barnes	2024-06-17 18:17:58 +00:00
Mihir Patel	153362fbc9	Support HSDP + Monolith Checkpointing (#128446 ) Fixes #128444. Rank 0 check should be in the same group as the broadcast Pull Request resolved: https://github.com/pytorch/pytorch/pull/128446 Approved by: https://github.com/fegin	2024-06-17 16:59:41 +00:00
ibartol	c6b180a316	Created docs (and example) for cudart function in torch.cuda (#128741 ) Fixes #127908 ## Description Created docs to document the torch.cuda.cudart function to solve the issue #127908. I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise). Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake. ### Summary of Changes - Added docs for torch.cuda.cudart - Added the cudart function in the autosummary of docs/source/cuda.rst ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741 Approved by: https://github.com/msaroufim	2024-06-17 16:50:37 +00:00
drisspg	fc2913fb80	Remove amax return from _scaled_mm (#128683 ) # Summary The primary reason for the change was lack of current use case and the need to work around an two Inductor issue. - Tensor arguments as kwarg only - multiple outputs from triton templates If the need for the amax return type arises we can consider either adding it, more likely creating a separate op. In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels. ### Changes: - This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision. - We currently still allow for fp8 returns and scaled result. Perhaps we should also ban this as well... New signature: ```Python def meta_scaled_mm( self: torch.Tensor, mat2: torch.Tensor, scale_a: torch.Tensor, scale_b: torch.Tensor, bias: Optional[torch.Tensor] = None, scale_result: Optional[torch.Tensor] = None, out_dtype: Optional[torch.dtype] = None, use_fast_accum: bool = False, ) -> torch.Tensor: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683 Approved by: https://github.com/vkuzo	2024-06-17 16:48:00 +00:00
Andrew Hoblitzell	73b78d1cbe	Document the torch.nn.parallel.scatter_gather.gather function (#128566 ) Fixes #127899 ### Description Add docstring to `torch/nn/parallel/scatter_gather.py:gather` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/128566 Approved by: https://github.com/kwen2501	2024-06-17 16:44:17 +00:00
Jiashen Cao	316b729677	[Fix] TS converter constant to tensor (#128442 ) #### Issue Tensor constant was previously lifted directly as an input in the fx graph, which results errors for multiple test cases with tensor constant. This PR introduces a fix to convert tensor constant to a `GetAttr` in the fx graph. This PR also introduces other fixes to maintain a valid `state_dict` for exported program when there are tensor constants. In short, after tensor constants are converted as `GetAttr`, they are treated as buffers during retracing. The fix will convert those back from buffer to constant. #### Test Plan Add new test cases that generate tensor constants * `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128442 Approved by: https://github.com/angelayi	2024-06-17 16:42:43 +00:00
Xuehai Pan	a87d82abd7	[BE] enable UFMT for `torch/nn/*.py` (#128593 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596, #128594, #128592	2024-06-17 16:29:29 +00:00
Xuehai Pan	f6e6e55fa7	[BE] enable UFMT for `torch/nn/functional.py` (#128592 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596, #128594	2024-06-17 16:29:29 +00:00
Xuehai Pan	95ac2d6482	[BE] enable UFMT for `torch/nn/modules` (#128594 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596	2024-06-17 16:29:25 +00:00
Xuehai Pan	dff6342a0b	[BE][Easy] enable UFMT for `torch/nn/parallel` (#128596 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128596 Approved by: https://github.com/mikaylagawarecki	2024-06-17 16:29:22 +00:00
Zhengxu Chen	bfad0aee44	[export] Preserve requires_grad for export inputs. (#128656 ) Summary: Today meta['val'] on placeholder nodes doesn't preserve the consistent requires_grad information with the original inputs. Seems there's no easy way to fix this directly at proxy tensor layer. This is useful for reexporting joint graph. Test Plan: test_preserve_requires_grad_placeholders Differential Revision: D58555651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128656 Approved by: https://github.com/tugsbayasgalan	2024-06-17 16:26:08 +00:00
Joel Schlosser	2a41fc0390	Short-term fix to preserve NJT metadata cache in torch.compile (#122836 ) Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile. For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors. NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing. Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836 Approved by: https://github.com/soulitzer ghstack dependencies: #127007, #128057	2024-06-17 15:25:09 +00:00
Sam Larsen	24443fe16a	[inductor] parallel compile: Print traceback detail when there's an exception in a sub-process (#128775 ) Summary: We lose traceback info when an exception occurs in a subprocess because Python traceback objects don't pickle. In the subprocess-based parallel compile, we _are_ logging an exception in the subprocess, but a) those messages are easy to miss because they're not in the traceback output, and b) it seems that logging in the subproc is swallowed by default in internal builds. This PR captures the traceback in the subprocess and makes it available in the exception thrown in the main process. Users now see failures that look like this: ``` ... File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: SubprocException: An exception occurred in a subprocess: Traceback (most recent call last): File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 270, in do_job result = SubprocMain.foo() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 263, in foo SubprocMain.bar() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 260, in bar SubprocMain.baz() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 257, in baz raise Exception("an error occurred") Exception: an error occurred ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128775 Approved by: https://github.com/jansel	2024-06-17 15:10:47 +00:00
Nikita Shulga	e3093849e5	[Docs] Update links (#128795 ) From https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding to https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html And from https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag to https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html Fixes https://github.com/pytorch/pytorch/issues/128774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128795 Approved by: https://github.com/atalman	2024-06-17 14:55:32 +00:00
Ambareesh Shyam Sundar	0f81473d7b	Update fake tensor error checks for bool tensor subtraction (#128492 ) Fixes #127003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128492 Approved by: https://github.com/soulitzer	2024-06-17 13:41:15 +00:00
Animesh Jain	b0282071c4	[dynamo] override torch.nn.modules.activation._is_make_fx_tracing (#128748 ) Discovered while inlining `MultiHeadAttention` nn Module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128748 Approved by: https://github.com/jansel ghstack dependencies: #128315	2024-06-17 08:49:29 +00:00
Xu Han	b40a033c38	[cpp_extension][inductor] Fix sleef windows depends. (#128770 ) # Issue: During I'm working on enable inductor on PyTorch Windows, I found the sleef lib dependency issue. <img width="1011" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/423bd854-3c5f-468f-9a64-a392d9b514e3"> # Analysis: After we enabled SIMD on PyTorch Windows(https://github.com/pytorch/pytorch/pull/118980 ), the sleef functions are called from VEC headers. It bring the sleef to the dependency. Here is a different between Windows and Linux OS. ## Linux : Linux is default export its functions, so libtorch_cpu.so static link to sleef.a, and then It also export sleef's functions. <img width="647" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/00ac536c-33fc-4943-a435-25590508840d"> ## Windows: Windows is by default not export its functions, and have many limitation to export functions, reference: https://github.com/pytorch/pytorch/issues/80604 We can't package sleef functions via torch_cpu.dll like Linux. # Solution: Acturally, we also packaged sleef static lib as a part of release. We just need to help user link to sleef.lib, it should be fine. 1. Add sleef to cpp_builder for inductor. 2. Add sleef to cpp_extension for C++ extesion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128770 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-17 05:44:34 +00:00
Wang, Eikan	a52c8ace98	[3/N] Non-Tensor: Support string parameter for aten operations (#125831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-06-17 05:11:29 +00:00
cyy	74e11a4210	Enable clang-tidy on torch/csrc/mps (#128782 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128782 Approved by: https://github.com/Skylion007	2024-06-17 02:19:48 +00:00
cyy	f9dae86222	Concat namespaces in torch/csrc/utils/* (#128787 ) Concat namespaces in torch/csrc/utils/* Pull Request resolved: https://github.com/pytorch/pytorch/pull/128787 Approved by: https://github.com/Skylion007	2024-06-16 23:51:14 +00:00
Mark Saroufim	6cbdbb6c3c	Remove top lev numpy dependency from fuzzer.py (#128759 ) Test CI This fixes issues like this where I don't even intend to use the fuzzer. this way if someone is calling functions from the fuzzer numpy will be imported otherwise the import should not happen at the top of the file ``` >>> import torchao Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/__init__.py", line 26, in <module> from torchao.quantization import ( File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/__init__.py", line 7, in <module> from .smoothquant import * # noqa: F403 File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/smoothquant.py", line 18, in <module> import torchao.quantization.quant_api as quant_api File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 23, in <module> from torchao.utils import ( File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/utils.py", line 2, in <module> import torch.utils.benchmark as benchmark File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/__init__.py", line 4, in <module> from torch.utils.benchmark.utils.fuzzer import * # noqa: F403 File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/utils/fuzzer.py", line 5, in <module> import numpy as np ModuleNotFoundError: No module named 'numpy' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128759 Approved by: https://github.com/Skylion007	2024-06-16 16:34:12 +00:00
leslie-fang-intel	f8d60e0e0a	[Inductor][CPP] Fix Half data type cse cache issue for CPP Backend (#128498 ) Summary Fixing issue: https://github.com/pytorch/pytorch/issues/128263. After https://github.com/pytorch/pytorch/issues/115260, we cached the higher precision cse variable to avoid duplicate casting between buffers. However, it failed to check the original data type. This means if we convert `int32` to `bf16` for `store` and then convert `bf16` back to `fp32` for `load`, it would incorrectly hit the cache and reuse the `int32` cse var. This PR fixes the issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_128263 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128498 Approved by: https://github.com/jgong5, https://github.com/zhuhaozhe, https://github.com/jerryzh168	2024-06-16 11:27:13 +00:00
Will Feng	979edbbe12	[Traceable FSDP2] Dynamo support FSDP2 use_training_state context manager (#127854 ) Improve Dynamo to support the FSDP2 `use_training_state()` context manager. Test command: ` pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_dynamo_trace_use_training_state ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127854 Approved by: https://github.com/yanboliang	2024-06-16 08:48:52 +00:00
Animesh Jain	e4d8aa4d24	[torchbench] Enable some models with inline_inbuilt_nn_modules (#128315 ) For all models, graph breaks/recompiles reduce. For drq, it increases and this is a legit one. Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315 Approved by: https://github.com/jansel	2024-06-16 08:37:23 +00:00
xinan.lin	cc518ebd38	[Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 2) (#124147 ) Reuse Inductor test case for Intel GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124147 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-06-16 08:07:05 +00:00
Blaine Burton Rister	f1ee3589a1	[Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342 ) Summary Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets. Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be: - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1. - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size. Note that a slice of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`. Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077 Example kernel: ``` triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4096 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK]) tmp1 = tmp0 + tmp0 tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32)) ''', device_str='cuda') ``` Test Plan This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories: - Can we generate strided block pointers for the appropriate shapes? - Powers of 2 - Non-power of 2, but multiple of the maximum block size - Arbitrary leading dimensions, with power of 2 inner dimensions - Weird strides and offsets - Reductions - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo) - Broadcasts (some variables are missing from the indexing expression) - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers? - Unsupported static shapes - Unsupported symbolic shapes - Mixing and matching these cases: - Pointwise and reduction in the same kernel - Sanity check the test harness - Do we raise an exception if the expected number of block pointers and the actual number are different? Follow-ups There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs: - Handle non-divisible shapes - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted. - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton. - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead? Differential Revision: D56739375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-06-16 07:35:57 +00:00
Michael Lazos	a61939467a	Enable passing dynamo-traced complex test (#128771 ) Fixes https://github.com/pytorch/pytorch/issues/118159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128771 Approved by: https://github.com/anijain2305	2024-06-16 07:28:09 +00:00
BowenBao	ab13980424	[ONNX] Update 'person_of_interest.rst', 'CODEOWNERS' and 'merge_rules.yaml' (#126364 ) The following are all constrained under the ONNX exporter project scope. - `personal_of_interest.rst` - Moving folks no longer working on the project to emeritus. - Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre, who have all made countless contributions to this project. - `CODEOWNERS` - Removing folks no longer working on the project. - Updating new owners who will now be notified with PRs related to the specific file paths. - `merge_rules.yaml` - Removing folks no longer working on the project. 🫡 Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD	2024-06-16 04:52:16 +00:00
Oguz Ulgen	6079c50910	Make config.fx_graph_remote_cache be three-value switch (#128628 ) Summary: We want to allow for three configurations False: Force off True: Force on None: OFF for OSS and JK config for internal Test Plan: CI Differential Revision: D58535897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128628 Approved by: https://github.com/masnesral, https://github.com/eellison	2024-06-15 17:52:09 +00:00
Sam Larsen	94c0dcbe1d	[inductor] Parallel compile: handle crashes in subprocesses (#128757 ) Summary: If any subprocess in the pool crashes, we get a BrokenProcessPool exception and the whole pool becomes unusable. Handle crashes by recreating the pool. Test Plan: * New unit test * Started a long-running test (`test/inductor/test_torchinductor.py`), periodically killed subprocess manually, made sure the test run recovers and makes progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128757 Approved by: https://github.com/jansel	2024-06-15 17:35:04 +00:00
David Berard	f0d68120f4	[subclasses] Handle dynamo inputs that are subclass views with (-1) in the view (#128662 ) When handling an input to dynamo that's a view of a subclass, dynamo does some handling to reconstruct the view. Part of this is to construct symints for the input parameters to the view. Previously, the code would just call `create_symbol()` which by default specifies a _positive_ symint (>= 0); this fails in the case where you have an aten::view that was called with a -1. Fix: just specify `positive=None` when calling `create_symbol()`, to avoid restricting the symint to >= 0 or <= 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128662 Approved by: https://github.com/jbschlosser	2024-06-15 14:58:18 +00:00
Wang, Eikan	18634048a1	Separate AOTI Eager utils as a single file (#125819 ) The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #125308	2024-06-15 13:42:49 +00:00
Yifu Wang	7a39755da2	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-15 10:20:21 +00:00
Wang, Eikan	60bbdc0b40	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used. - `Tensor` - `bool` - `int64_t` - `TensorList` - `Scalar` - `c10::SymIntArrayRef` - `::std::optional<Tensor>` - `IntArrayRef` - `double` - `c10::SymInt` - `::std::optional<ScalarType>` - `::std::optional<double>` - `::std::optional<bool>` - `::std::optional<Layout>` - `::std::optional<Device>` - `::std::optional<int64_t>` - `Dimname` - `::std::optional<Generator>` - `c10::string_view` - `::std::optional<c10::string_view>` - `OptionalIntArrayRef` - `::std::optional<Scalar>` - `OptionalSymIntArrayRef` - `::std::optional<MemoryFormat>` - `::std::optional<c10::SymInt>` - `ScalarType` - `ArrayRef<Scalar>` - `DimnameList` - `::std::optional<ArrayRef<double>>` - `::std::array<bool,3>` - `::std::optional<DimnameList>` - `c10::List<::std::optional<Tensor>>` - `::std::array<bool,2>` - `Storage` - `::std::array<bool,4>` - `Device` - `DeviceIndex` - `ITensorListRef` - `Stream` - `Layout` - `MemoryFormat` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-15 09:18:44 +00:00
Michael Lazos	de4f379cf2	run mkldnn test with inlining (#128749 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128749 Approved by: https://github.com/anijain2305	2024-06-15 09:04:08 +00:00
Tristan Rice	b50c0e94c2	TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY (#128739 ) This adjusts the settings of the libuv backend to match the older TCPStore. * DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset. * TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations. Test plan: ``` python test/distributed/test_store.py -v -k LibUv ``` Benchmark script: ``` import time import os import torch.distributed as dist rank = int(os.environ["RANK"]) store = dist.TCPStore( host_name="<server>", port=29500, world_size=2, is_master=(rank == 0), use_libuv=True, ) if rank == 1: total_iters = 0 total_dur = 0 for iter in range(10): iters = 500000 start = time.perf_counter() for i in range(iters): store.set(f"key_{i}", f"value_{i}") dur = time.perf_counter() - start print(f"{iter}. {iters} set, qps = {iters/dur}") total_iters += iters total_dur += dur print(f"overall qps = {total_iters/total_dur}") else: print("sleeping") time.sleep(1000000000) ``` Performance seems to be negligible difference between TCP_NODELAY and not for a single host Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739 Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o	2024-06-15 07:40:18 +00:00
cyy	e4c32d14a8	[3/N] Remove inclusion of c10/util/string_utils.h (#128504 ) Follows #128372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504 Approved by: https://github.com/malfet	2024-06-15 06:38:40 +00:00
Oguz Ulgen	472211c97a	Make assert_size_stride to return all errors (#128764 ) This will help debug some problems I'm encountering, but in general, it is best to show the entire error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128764 Approved by: https://github.com/jansel	2024-06-15 06:32:40 +00:00
Sahdev Zala	4ccbf711e2	Learning Rate Scheduler docstring fix (#128679 ) Fix docstrings in Learning Rate Scheduler. The fix can be verified by running pydocstyle path-to-file --count Related #112593 BEFORE the PR: pydocstyle torch/optim/lr_scheduler.py --count  92  AFTER the PR: pydocstyle torch/optim/lr_scheduler.py --count  0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679 Approved by: https://github.com/janeyx99	2024-06-15 05:30:35 +00:00
Animesh Jain	108adbc726	[dynamo][side effects] Raise assertion error if the object is already tracked for mutation (#128590 ) This issue was pointed out by @tombousso here - https://github.com/pytorch/pytorch/pull/128269#issuecomment-2163755792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128590 Approved by: https://github.com/mlazos ghstack dependencies: #128715, #128269	2024-06-15 05:07:49 +00:00
Xu Han	9ebf77b13b	Fix windows inductor defination issue (#128686 ) Changes: 1. Add memory align macro support on Windows. 2. Fix `#pragma unroll` not support on MSVC cl compiler. `#pragma unroll` occur error on msvc `cl` compiler, but it would be supported on Windows `clang`. We'd better disable it only on `__msvc_cl__` compiler, and get better performance if we enabled `clang`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128686 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-15 03:02:00 +00:00
Animesh Jain	7e092a62e6	[dynamo] Support weakref objects (#128533 ) Fixes https://github.com/pytorch/pytorch/issues/125720 I was earlier worried that DELETE_* or STORE_* on referent values should result in a graph break, because they could invalidate the weak ref. But then @zou3519 pointed out that weakref invalidation will happen EVENTUALLY, CPython provides no guarantees when the weakref will be invalidated (even when the user calls del x and x is the last reference). So any code that relies on del x to invalidate the weakref of x right away is BAD code. CPython provide no guarantees. Therefore we can (ab)use this nuance, and can just ignore DELETE_* or STORE_* on the referent objects. The only corner case is when Dynamo is reconstructing the weakref object. Dynamo will have a hard time being correct here, so just SKIP_FRAME on such a case. This is rare. Cpython notes 1) https://docs.python.org/3/library/weakref.html 2) https://docs.python.org/3/reference/datamodel.html#index-2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128533 Approved by: https://github.com/jansel	2024-06-15 02:16:25 +00:00
Animesh Jain	62a0e39ced	[dynamo][inlining-nn-modules] Update tests with new expected counts (#128463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128463 Approved by: https://github.com/yanboliang	2024-06-15 02:08:02 +00:00
vasiliy	2d01f87737	Enable torch.empty for float8 dtypes + deterministic mode + cpu (#128744 ) Summary: Enables creating empty float8 tensors for: * cuda when `torch.use_deterministic_algorithms` is set to True * cpu for all settings of `torch.use_deterministic_algorithms` Context for NaN values of float8_e4m3fn and float8_e5m2: https://arxiv.org/pdf/2209.05433, Section 3, Table 1 Context for NaN values of float8_e4m3fnuz and float8_e5m2fnuz: https://arxiv.org/pdf/2206.02915, Section 3.2, "instead of reserving one exponent field to represent Inf and NaN, we reserve only a single codeword (corresponding to negative zero)" Test Plan: ``` python test/test_quantization.py -k test_empty ``` Reviewers: Subscribers: Tasks: Tags: Fixes https://github.com/pytorch/pytorch/issues/128733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128744 Approved by: https://github.com/malfet, https://github.com/drisspg	2024-06-15 02:05:30 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
PyTorch MergeBot	5efe71f134	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit 5d9a609b4f6c94fb930188e4d7c99f53d989c022. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/huydhn due to Sorry for reverting your change but another failed test shows up in trunk inductor/test_flex_attention.py where it needs to be updated `5d9a609b4f`. I guess it is easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2169030779))	2024-06-15 01:46:23 +00:00
Huy Do	f37121bb74	Add model name, quantization and device to gpt_fast micro benchmark output (#128091 ) A small enhancement to https://hud.pytorch.org/benchmark/llms with these columns in the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128091 Approved by: https://github.com/yanboliang	2024-06-15 01:39:48 +00:00
Fuzzkatt	3f47c72268	add multiprocessing checks in test_dataloader.py (#128244 ) Add multiprocessing checks in test_dataloader.py for tests requiring multiprocessing similar to test_multiprocessing.py: https://github.com/pytorch/pytorch/blob/main/test/test_multiprocessing.py#L41-L52. Change all Jetson skips to TEST_CUDA_IPC checks since that is the root cause of the failures on Jetson in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128244 Approved by: https://github.com/eqy, https://github.com/malfet	2024-06-15 01:32:55 +00:00
Yueming Hao	73ba432d32	[custom_op]Fix None return schema (#128667 ) Fixes #125044 If users define a schema returns `None`, it will be parsed to a `torch.NoneType`. Auto functionalization support the `()` as a empty return but not for `None`. So, `None` return fails the check for [`can_auto_functionalize`](https://github.com/pytorch/pytorch/blob/findhao/fix_none_return_functionalize/torch/_higher_order_ops/auto_functionalize.py#L71) even we can take this as a `()` return. This PR is a fix to skip the check for None return. I hope it can be fixed in a [deeper level](`31e44c72ca`), but this fix breaks a lot of existing schemas. So it's better to fix this issue in the auto_functionalize.py at this moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128667 Approved by: https://github.com/zou3519	2024-06-15 00:41:37 +00:00
leslie-fang-intel	6616ad030f	[Inductor] Fix the High Order Op layout issue (#128275 ) Fix the issue: https://github.com/pytorch/pytorch/issues/127995 - In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. `921aa194c7/torch/_inductor/ir.py (L5632-L5649)` - If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode` which causes the wrong generated code. `ef2b5ed500/torch/_inductor/scheduler.py (L2701-L2709)` Test Plan ``` python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128275 Approved by: https://github.com/eellison	2024-06-15 00:33:21 +00:00
angelayi	5d9a609b4f	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-06-15 00:26:04 +00:00
Sanket Jayant Purandare	d67923b955	Adding kwargs to composable AC API to enable full capabilities (#128516 ) Summary: Firstly, this does not change any existing behaviour, since all the default values for kwargs were hardcoded into the ``_checkpoint_without_reentrant_generator`` call. Secondly, this is needed for unlocking the full potential of composable checkpointing making it equivalent to ``torch.utils.checkpoint.checkpoint(use_reentrant=False)``. Finally, an added benefit is now composable checkpointing can be used under ``FakeTensorMode`` by passing ``preserve_rng_state=False``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128516 Approved by: https://github.com/awgu	2024-06-15 00:23:48 +00:00
Brian Hirsh	271852aa7e	inductor: pre-grad bmm pass shouldn't match if output is mutated (#128570 ) This PR is enough to get this test to pass when using `TORCHDYNAMO_INLINE_INBUILT_NN_MODULES`: ``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1 python test/inductor/test_group_batch_fusion.py -k TestPostGradBatchLinearFusion.test_batch_linear_post_grad_fusion ``` inductor has a pre-grad pass to swap out multiple `linear` layers with with `addbmm`, but it also needs to insert an `unbind()` at the end. If that unbind is then followed by a mutation (like `add_()`), the autograd engine will complain (autograd does not let you mutate the output of multiple-out-view ops like unbind). I made a tweak to the pattern matching logic to avoid matching if the output of the linear is used in an op that mutates its input. My hope is that: (1) this situation is rare enough that it won't materially impact pattern matching in real world code (2) I had to use a heuristic for "is an op a mutable op", since the graph we get is from dynamo, so it can contain code like `operator.iadd` in it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128570 Approved by: https://github.com/eellison, https://github.com/mlazos ghstack dependencies: #127927	2024-06-15 00:08:44 +00:00
Brian Hirsh	ba19ed9a1a	FunctionalTensor: dispatch metadata directly to inner tensor (#127927 ) Fixes https://github.com/pytorch/pytorch/issues/127374 The error in the linked repro is: ``` AssertionError: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.sym_storage_offset.default(_to_functional_tensor(FakeTensor(..., device='cuda:0', size=(16, 4), dtype=torch.uint8), device='cuda:0')) ``` Where we hit FakeTensor.__torch_dispatch__, but our input is a C++ `FunctionalTensorWrapper`. What should actually have happened is that the call to `aten.sym_storage_offset` hits the `Functionalize` dispatch key, which should remove the `FunctionalTensorWrapper` and redispatch. I spent some time debugging and haven't actually figured out why this isn't happening. Instead, this PR just skips that step completely, and asks `FunctionalTensor` to directly unwrap the C++ `FunctionalTensorWrapper` when querying tensor metadata. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127927 Approved by: https://github.com/tugsbayasgalan	2024-06-15 00:08:44 +00:00
dilililiwhy	574a2cbcb7	Enable UFMT on common_device_type.py and common_dtype.py (#128490 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > torch/testing/_internal/common_device_type.py > torch/testing/_internal/common_dtype.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128490 Approved by: https://github.com/ezyang, https://github.com/XuehaiPan	2024-06-15 00:07:42 +00:00
PaliC	0492ec460a	[BE] Remove external testing of torch::deploy (#127952 ) As we don't expect external users of torch::deploy as the library is no longer supported, we will remove external testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127952 Approved by: https://github.com/malfet	2024-06-14 23:32:02 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
Tristan Rice	52d4442a00	[c10d] Socket, TCPStore: add better logging (#128673 ) This adds better logging of errors to the socket and TCPStore classes. All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged. It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673 Approved by: https://github.com/c-p-i-o	2024-06-14 23:08:29 +00:00
Yang Chen	4abecd7102	[AOTI] fixed performance issue for AOTI_TORCH_CHECK (#128402 ) We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128402 Approved by: https://github.com/desertfire	2024-06-14 23:03:17 +00:00
Huy Do	fd27138c4a	Update DALLE2_pytorch expected accuracy result on CPU (#128718 ) I suspect that the issue shows up because of the new version of https://pypi.org/project/pyarrow/16.1.0/#history released yesterday. The package is a dependency of DALLE2_pytorch https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/DALLE2_pytorch/install.py#L22. I'll just update the expected accuracy result on CPU benchmark because the model fails to run there anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128718 Approved by: https://github.com/malfet	2024-06-14 22:54:21 +00:00
Catherine Lee	d3a4d9e4fe	Update cu124 dynamo benchmark expected values (#128737 ) Missed one in https://github.com/pytorch/pytorch/pull/128589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128737 Approved by: https://github.com/Skylion007	2024-06-14 22:23:00 +00:00
titaiwangms	bca2cf00ed	[ONNX] Add dynamic axes support to torchscript exporter with dynamo=True (#128371 ) This PR enables specific axe to be dynamic with calling torch.export.export and torch.export.Dim. Features: (1) Turn dynamic_axes to dynamic_shapes (2) Dim constraints remain the same (see test case with hitting constraints). This might give different user experience, since we didn't have any constraints in torchscript-onnx exporting. (3) If input_names is used in dynamic_axes, ValueError will be raised, as input_names is currently not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128371 Approved by: https://github.com/justinchuby	2024-06-14 21:56:51 +00:00
Isuru Fernando	f103247a14	Run all samples for torchinductor tests (#128343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343 Approved by: https://github.com/lezcano	2024-06-14 21:52:12 +00:00
angelayi	e9c6e8369c	Torchbind call method + effects support (#128397 ) Adds effect token support to torchbind method calls by allowing `with_effects` to take in `torch.ops._higher_order_ops.call_torchbind` as an input. Here is the print from `TORCH_LOGS="aot" python test/export/test_torchbind.py -k test_compile_obj_torchbind_op`: ```python def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2]", arg2_1): # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1266 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos()) cos: "f32[2]" = torch.ops.aten.cos.default(arg1_1) with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, cos); arg0_1 = cos = None getitem: "f32[0]" = with_effects[0]; with_effects = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1267 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos() + 1) cos_1: "f32[2]" = torch.ops.aten.cos.default(arg1_1) add: "f32[2]" = torch.ops.aten.add.Tensor(cos_1, 1); cos_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, add); getitem = add = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1268 in f, code: torch.ops._TorchScriptTesting.queue_pop(tq) with_effects_2 = torch._higher_order_ops.effects.with_effects(getitem_2, torch.ops._TorchScriptTesting.queue_pop.default, arg2_1); getitem_2 = None getitem_4: "f32[0]" = with_effects_2[0]; with_effects_2 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1269 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.sin()) sin: "f32[2]" = torch.ops.aten.sin.default(arg1_1); arg1_1 = None with_effects_3 = torch._higher_order_ops.effects.with_effects(getitem_4, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, sin); getitem_4 = sin = None getitem_6: "f32[0]" = with_effects_3[0]; with_effects_3 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1270 in f, code: return tq.pop(), tq.pop() + tq.size(), tq with_effects_4 = torch._higher_order_ops.effects.with_effects(getitem_6, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop'); getitem_6 = None getitem_8: "f32[0]" = with_effects_4[0] getitem_9: "f32[2]" = with_effects_4[1]; with_effects_4 = None with_effects_5 = torch._higher_order_ops.effects.with_effects(getitem_8, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop'); getitem_8 = None getitem_10: "f32[0]" = with_effects_5[0] getitem_11: "f32[2]" = with_effects_5[1]; with_effects_5 = None with_effects_6 = torch._higher_order_ops.effects.with_effects(getitem_10, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'size'); getitem_10 = arg2_1 = None getitem_12: "f32[0]" = with_effects_6[0]; with_effects_6 = None add_1: "f32[2]" = torch.ops.aten.add.Tensor(getitem_11, 0); getitem_11 = None return (getitem_12, getitem_9, add_1) ``` In order to support this, this PR makes the following changes: * Adds `FakeScriptObject` to `CustomObjArgument`, which will be put on the `meta["val"]` of nodes representing torchbind objects. * Adds pickle/deepcopy support to FunctionSchema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128397 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-06-14 21:28:17 +00:00
ibartol	65d3ddcb8b	Add GLIBC requirements for libtorch to solve #113124 (#128135 ) Fixes #113124. ## Description I modified the installing.rst file to address the system requirements and troubleshooting steps for using LibTorch with different GLIBC versions. ### Summary of Changes - Added system requirements specifying the GLIBC version needed for both the cxx11 ABI version and the pre-cxx11 ABI version of LibTorch. - Included a troubleshooting section with instructions on how to check the dependencies of the LibTorch libraries and identify the required GLIBC version using the `ldd lib/libtorch.so` command. ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128135 Approved by: https://github.com/jbschlosser	2024-06-14 21:24:53 +00:00
titaiwangms	e9a29aaa4a	[ONNX] Add upsample trilinear to skip decomp (#128259 ) (1) Add upsample trilinear vec to skip decomposition (2) Add tests to make sure that torch.export.export still decomposes them Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259 Approved by: https://github.com/justinchuby	2024-06-14 21:20:44 +00:00
rzou	e6e102cf85	Dynamo testing: add some skips (#128734 ) The following tests are failing consistently for me locally, so we're going to skip them. They're disabled in CI but it looks like they're just always failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128734 Approved by: https://github.com/williamwen42 ghstack dependencies: #128731	2024-06-14 20:53:30 +00:00
rzou	11de50f17c	[Dynamo] skip some TorchScript tests (#128731 ) We don't care about the Dynamo x TorchScript composition, so I'm disabling these tests (so they don't get reported as flaky). Not disabling all of the TorchScript tests yet because they have been useful to catch random bugs. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731 Approved by: https://github.com/williamwen42	2024-06-14 20:53:30 +00:00
Simon Fan	4b96575a09	[dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196 ) FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched. For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196 Approved by: https://github.com/soulitzer	2024-06-14 20:28:08 +00:00
Animesh Jain	1aafb9eb90	[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 ) Fixes https://github.com/pytorch/pytorch/issues/101168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269 Approved by: https://github.com/jansel ghstack dependencies: #128715	2024-06-14 20:17:03 +00:00
Animesh Jain	9c77332116	[torch.compile][ci] Flaky models in CI (similar to DISABLED_TEST) (#128715 ) These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715 Approved by: https://github.com/eellison	2024-06-14 20:17:03 +00:00
Sanket Jayant Purandare	2e5366fbc0	Extended Module Tracker (#128508 ) This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes. 1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``. 2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``. 3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case. 4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508 Approved by: https://github.com/wanchaol	2024-06-14 19:48:46 +00:00
Menglu Yu	d50712e5e3	[PT2] add inductor log for unbind_stack_pass (#128684 ) Summary: Currently, we do not log the pass. To better enable pattern hit inspection, we enable it. Test Plan: see signal Differential Revision: D58571992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128684 Approved by: https://github.com/dshi7	2024-06-14 19:45:55 +00:00
Nikita Shulga	9035fff2de	[BE] Do not test deprecated `torch.nn.utils.weight_norm` (#128727 ) Test `torch.nn.utils.parametrizations.weight_norm` instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727 Approved by: https://github.com/kit1980 ghstack dependencies: #128726	2024-06-14 19:14:44 +00:00
Nikita Shulga	27458cc097	[BE] Refactor repeated code in test_weight_norm (#128726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726 Approved by: https://github.com/kit1980	2024-06-14 19:14:44 +00:00
Colin Peppler	a6bd154a42	[inductor] Support mm decomps for matrices with unbacked sizes (#128655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128655 Approved by: https://github.com/jansel	2024-06-14 18:35:42 +00:00
Nikita Shulga	b94c52dd29	[GHF] Refuse merge to non-default branch (#128710 ) Unless PR is ghstack one Test plan: ``` % GITHUB_TOKEN=$(gh auth token) python3 -c "from trymerge import GitHubPR; pr=GitHubPR('pytorch', 'pytorch', 128591); print(pr.base_ref(), pr.default_branch())" release/2.4 main ``` Fixes: https://github.com/pytorch/test-infra/issues/5339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128710 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-06-14 18:23:25 +00:00
Zhengxu Chen	be0eec9031	[export] Improve static typing in tracer. (#128552 ) Summary: as title. Test Plan: CI Differential Revision: D58485487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128552 Approved by: https://github.com/angelayi	2024-06-14 17:57:37 +00:00
PyTorch MergeBot	2367161e4b	Revert "[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 )" This reverts commit c339efaf023b4af056dad4cb2f11c07930ed8af6. Reverted https://github.com/pytorch/pytorch/pull/127966 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/127966#issuecomment-2168505985))	2024-06-14 17:57:23 +00:00
Peter Bell	d7fc871175	[inductor] Improve superfluous mask handling in triton codegen (#128518 ) This takes the logic from `filter_masks` and factors it out into `_has_constant_mask`. I also improve support for `persistent_reduction` kernels by making use of the static RBLOCK value and potentially XBLOCK too in the `no_x_dim` case. I then use this helper when generating the `xmask` and `rmask`, so we can generate them as constants meaning triton can optimize them even if they are included. e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)` before: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) * XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512x0)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = tl.where(rmask & xmask, tmp1, 0) tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0)) tl.store(out_ptr0 + (x0), tmp4, xmask) ``` after: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = tl.full([RBLOCK], True, tl.int1) rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = tl.full([RBLOCK], True, tl.int1) r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0)) tl.store(out_ptr0 + (x0), tmp3, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518 Approved by: https://github.com/lezcano	2024-06-14 17:52:55 +00:00
Menglu Yu	2357490524	[PT2] Enable shape_padding multiplier adjustment (#128346 ) Summary: Our experiments demonstrate that the current defautl value 1.1 may not be the best multiplier, and we thus enable the adjustment of the value to further improve the QPS. context: https://docs.google.com/document/d/10VjpOJkTv5A4sNX7dD6qT7PyhBxn6LSeLAuaqYtoOto/edit Test Plan: # IG_CTR {F1682138315} Differential Revision: D58373261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128346 Approved by: https://github.com/jackiexu1992	2024-06-14 17:49:24 +00:00
cyy	d4807da802	Various fixes of torch/csrc files (#127252 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127252 Approved by: https://github.com/r-barnes	2024-06-14 17:31:24 +00:00
Aart Bik	089e76cca3	[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 ) The assertEqualMeta() method already tests that the first argument is a FakeTensor https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523 Approved by: https://github.com/huydhn	2024-06-14 17:05:17 +00:00
Yanbo Liang	1fb4effe7a	[GPT-fast benchmark] Add MLP, gather + gemv, gemv micro benchmark (#128002 ) Output example: ``` \| name \| metric \| target \| actual \| \|------------------------------\|---------------------------\|---------\|---------\| \| layer_norm_bfloat16 \| memory_bandwidth(GB/s) \| 1017 \| 1000.01 \| \| mlp_layer_norm_gelu_bfloat16 \| flops_utilization \| 0.71 \| 0.71 \| \| gemv_int8 \| memory_bandwidth(GB/s) \| 990 \| 984.06 \| \| gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1137 \| 1137.92 \| \| gather_gemv_int8 \| memory_bandwidth(GB/s) \| 1113 \| 1111.09 \| \| gather_gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1249 \| 1248.15 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128002 Approved by: https://github.com/Chillee	2024-06-14 17:03:22 +00:00
Laith Sakka	4c84af0f5d	Fix indexing and slicing of ranges in dynamo (#128567 ) Fix https://github.com/pytorch/pytorch/issues/128520 Dynamo does not handle range()[binary subscript] or range()[trinary_subscript] correctly. Right now it calls the get_item function which basically applies the subscript operation on top of the list of [start, end, step]! which is completely not related to what is expected. in python, range()[complex subscript] is another range, ex: range(1, 10, 2)[1:4:1] is range(3, 9, 2) and range(1, 10, 2)[1:4:1] is range(-9, 9, 2) This diff fix index and slice applications on range. it mimics implementations from (https://github.com/python/cpython/blob/main/Objects/rangeobject.c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128567 Approved by: https://github.com/anijain2305	2024-06-14 16:49:49 +00:00
PyTorch MergeBot	f75f5987aa	Revert "Extended Module Tracker (#128508 )" This reverts commit 1f46284f9ed5b60981174e689d750b358b19e4c4. Reverted https://github.com/pytorch/pytorch/pull/128508 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/9515753429/job/26230639980 ([comment](https://github.com/pytorch/pytorch/pull/128508#issuecomment-2168405784))	2024-06-14 16:46:03 +00:00
Aaron Orenstein	732b4e9074	Fix generated vararg types (#128648 ) In the generated files torchgen is incorrectly generating types on the varargs. The changes all look like this (changing `size: _int` to `size: Union[_int, SymInt]`): ``` --- ./torch/_VF.pyi.sav 2024-06-13 20:36:49.189664629 -0700 +++ ./torch/_VF.pyi 2024-06-13 20:36:57.208894614 -0700 @@ -168,17 +168,17 @@ @overload def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], , dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... @overload -def _efficientzerotensor(size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... +def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... @overload ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648 Approved by: https://github.com/jamesjwu	2024-06-14 16:04:37 +00:00
Kiuk Chung	8629939a51	[torch/c10] Add C10_UBSAN_ENABLED macro and use it to disable SymInt_… (#127967 ) Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check). Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED` and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`. > NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too... Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967 Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet	2024-06-14 16:01:12 +00:00
PyTorch MergeBot	ee140a198f	Revert "[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591 )" This reverts commit 03e8a4cf45ee45611de77b55b515a8936f60ce31. Reverted https://github.com/pytorch/pytorch/pull/128591 on behalf of https://github.com/atalman due to Contains release only changes should not be landed ([comment](https://github.com/pytorch/pytorch/pull/128591#issuecomment-2168308233))	2024-06-14 15:51:00 +00:00
eellison	c187593418	Prevent expansion of cat indexing to avoid int64 intermediate (#127815 ) Fix for https://github.com/pytorch/pytorch/issues/127652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815 Approved by: https://github.com/shunting314, https://github.com/peterbell10	2024-06-14 15:42:08 +00:00
Andres Lugo-Reyes	c339efaf02	[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 ) Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560 This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069 unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping. The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966 Approved by: https://github.com/pruthvistony, https://github.com/zou3519	2024-06-14 15:24:28 +00:00
Huamin Li	c76a9d13cb	Revert D56709309 (#128481 ) Summary: potential fw compatibility issue raised from D58397323 Test Plan: Sandcastle Reviewed By: houseroad Differential Revision: D58443190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128481 Approved by: https://github.com/desertfire	2024-06-14 14:57:17 +00:00
rzou	9972e5f447	Rename impl_abstract to register_fake, part 2/2 (#123938 ) This PR renames the implementation details of register_fake to align more with the new name. It is in its own PR because this is risky (torch.package sometimes depends on private library functions and implementation details). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938 Approved by: https://github.com/williamwen42	2024-06-14 14:37:24 +00:00
Zheng, Zhaoqiong	a2d9c430b4	Adding a note for Getting Started with PyTorch on Intel GPUs (#127872 ) Adding a note for Getting Started with PyTorch on Intel GPUs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872 Approved by: https://github.com/svekars	2024-06-14 14:24:28 +00:00
Luca Wehrstedt	dfc4b608e1	Remove leftover warning causing log spew (#128688 ) This warning was left by mistake, and is uninformative (the user is doing nothing wrong) and causing log spew in trainings. See https://github.com/pytorch/pytorch/pull/120750#discussion_r1638430500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128688 Approved by: https://github.com/drisspg	2024-06-14 14:08:11 +00:00
Nikita Shulga	e1dfc61250	Document CI/CD security philosophy (#128316 ) Namely: - when use of non-ephemeral runners is OK, vs when it is not - Why binary build pipelines should not use distributed caching - Why temporary CI artifacts should not be considered safe Pull Request resolved: https://github.com/pytorch/pytorch/pull/128316 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-06-14 13:47:25 +00:00
cyy	bfd5ea93e0	Enable clang-tidy on c10/util/Float8.h (#120573 ) This PR clears warnings and enables clang-tidy on c10/util/Float8.h. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120573 Approved by: https://github.com/drisspg	2024-06-14 13:47:07 +00:00
Sanket Jayant Purandare	1f46284f9e	Extended Module Tracker (#128508 ) This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes. 1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``. 2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``. 3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case. 4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508 Approved by: https://github.com/wanchaol	2024-06-14 12:01:53 +00:00
Isuru Fernando	e397ad6883	Improve codegen for ops.masked in triton (#128054 ) Fixes https://github.com/pytorch/pytorch/issues/127930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-06-14 11:52:56 +00:00
Colin Peppler	7e734e2d08	[inductor] Fix nested indirect indexing case for index_propagation (#128378 ) Tries to fix #127677. # Context Just as @peterbell10 pointed out, we have the following scenario: ``` a = ops.indirect_indexing(...) b = ops.index_expr(a, ...) c = ops.indirect_indexing(b, ...) ``` We can repro this as: ``` def forward(self, arg0_1, arg1_1, arg2_1): iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0), repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1); index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]); index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]); return (index_1,) ``` which should generate a JIT py file like this: ``` def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last') tmp1 = ks0 tmp2 = tmp0 + tmp1 tmp3 = tmp0 < 0 tmp4 = tl.where(tmp3, tmp2, tmp0) # check_bounds() tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) \| ~(xmask), "index out of bounds: 0 <= tmp4 < ks0") def call(): arg0_1, arg1_1, arg2_1 = args buf1 = aten.repeat_interleave.Tensor(arg1_1) buf4 = empty_strided_cuda((u0, 64), (64, 1)) triton_poi_fused_index_select_0.run( buf1, arg2_1, buf4, s0, triton_poi_fused_index_select_0_xnumel, grid=grid(triton_poi_fused_index_select_0_xnumel), stream=stream0) ``` # Issue In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`. `3b555ba477/torch/_inductor/ir.py (L8154-L8160)` When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`. # Approach When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-14 10:07:06 +00:00
Jason Ansel	99988be423	[halide-backend] Add test shard (#127308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127308 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #128266	2024-06-14 10:02:57 +00:00
Xia, Weiwen	03e8a4cf45	[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591 ) Port #127592 from main to release/2.4 ------ Fixes #127402 - Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py - Add checks of mutation for QLinearPointwiseBinaryPT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592 Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/128591 Approved by: https://github.com/jgong5, https://github.com/Chillee	2024-06-14 09:31:38 +00:00
PyTorch MergeBot	43ae3073f9	Revert "[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 )" This reverts commit ba3726d02b25dff92762c59d4dffe96a7babfa75. Reverted https://github.com/pytorch/pytorch/pull/128523 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Looks like your changes broke the inductor tests: inux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor. [Here you can find more details](`ba3726d02b`). ([comment](https://github.com/pytorch/pytorch/pull/128523#issuecomment-2167518145))	2024-06-14 08:27:05 +00:00
Will Constable	0344f95c2e	Add missing #include <array> to thread_name.cpp (#128664 ) I got local compile errors (using clang 14.0.6) due to this missing include after pulling the latest pytorch main. It's totally puzzling why CI appears to pass without this fix. Hopefully someone else will have an idea if we are missing some CI coverage or if I am using a strange build setup locally. The PR introducing the compile errors was https://github.com/pytorch/pytorch/pull/128448. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128664 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/d4l3k	2024-06-14 07:49:09 +00:00
Anshul Sinha	03725a0512	[dtensor][example] added MLPStacked example for printing sharding (#128461 ) Summary Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below: <img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92"> Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461 Approved by: https://github.com/XilunWu ghstack dependencies: #128369, #128451	2024-06-14 07:30:31 +00:00
Anshul Sinha	dd3b79a08f	[dtensor][be] improving readability of comm_mode.py and comm_mode_features_example.py (#128451 ) Summary I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code. Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451 Approved by: https://github.com/XilunWu ghstack dependencies: #128369	2024-06-14 07:30:31 +00:00
Anshul Sinha	e886122e98	[dtensor][debug] add module level tracing and readable display (#128369 ) Summary Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is: <img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791"> Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369 Approved by: https://github.com/XilunWu	2024-06-14 07:30:31 +00:00
yiliu30	4669c6d3ae	[quant][pt2e][quantizer] Support `set_module_name_qconfig` in X86InductorQuantizer (#126044 ) Summary: Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`. For example, only quantize the `sub`: ```python class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) self.sub = Sub() def forward(self, x): x = self.linear(x) x = self.sub(x) return x m = M().eval() example_inputs = (torch.randn(3, 5),) # Set config for a specific submodule. quantizer = X86InductorQuantizer() quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config()) ``` - Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level. - Unified the annotation process to follow this order: `module_name_qconfig`, `operator_type_qconfig`, and `global_config`. - Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs. - Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer. Test Plan ```bash python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name ``` @Xia-Weiwen @leslie-fang-intel @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2024-06-14 07:13:10 +00:00
Catherine Lee	674be9d3be	Update cu124 dynamo benchmark expected values (#128589 ) I believe this corresponds to changes in https://github.com/pytorch/pytorch/pull/127780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128589 Approved by: https://github.com/nWEIdia, https://github.com/DanilBaibak	2024-06-14 07:04:34 +00:00
PyTorch MergeBot	18f35d9e12	Revert "Run all samples for torchinductor tests (#128343 )" This reverts commit 41df20c07caecddb6d21d69a125f2998ae9313e8. Reverted https://github.com/pytorch/pytorch/pull/128343 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool3d_cuda_float16 and other tests `41df20c07c` https://github.com/pytorch/pytorch/actions/runs/9509191526/job/26213490266. I think this might be a landrace ([comment](https://github.com/pytorch/pytorch/pull/128343#issuecomment-2167275337))	2024-06-14 06:08:17 +00:00
David Berard	f48f7615dc	[easy][subclasses] dynamo.reset() in test_subclass_views (#128659 ) When we don't dynamo.reset(), we don't recompile on different dynamic shapes. Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple. Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes: ``` File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint return self._str(self._print(expr)) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, kwargs) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add t = self._print(term) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, kwargs) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul a_str = [self.parenthesize(x, prec, strict=False) for x in a] File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp> a_str = [self.parenthesize(x, prec, strict=False) for x in a] File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize return self._print(item) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, **kwargs) File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol assert self.symbol_to_source.get(expr), ( AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}. If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659 Approved by: https://github.com/YuqingJ	2024-06-14 05:18:07 +00:00
amdfaa	9ac08dab1f	Updates diskspace-cleanup for ROCm CI (#127947 ) Gets the location of the docker directory and outputs how much disk space is being used by docker. This is required since the new Cirrascale CI nodes for ROCm have docker root directory in a different partition. Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127947 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-06-14 04:32:38 +00:00
Huy Do	eff01bce21	Only run inductor A100 perf benchmark smoke test periodically (#128677 ) Attempt to mitigate the long queue on A100 as reported in https://github.com/pytorch/pytorch/issues/128627. From what I see, this change `03467b3fed/1` doubles the job duration from 20+ to 40+ minutes. This, together https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-cu124.yml and maybe an increase number of PR with `ciflow/inductor`, are all contributing to the long queue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128677 Approved by: https://github.com/atalman, https://github.com/desertfire	2024-06-14 02:39:33 +00:00
Aart Bik	ba3726d02b	[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 ) The assertEqualMeta() method already tests that the first argument is a FakeTensor https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523 Approved by: https://github.com/soulitzer	2024-06-14 02:34:51 +00:00
Sahdev Zala	685fcfb40d	Fix docstring in autograd (#128657 ) Fix docstrings in autograd files. The fix can be verified by running pydocstyle path-to-file --count Related #112593 BEFORE the PR:  pydocstyle torch/autograd/anomaly_mode.py --count 8 pydocstyle torch/autograd/__init__.py --count 9 AFTER the PR:  pydocstyle torch/autograd/anomaly_mode.py --count 0 pydocstyle torch/autograd/__init__.py --count 0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128657 Approved by: https://github.com/soulitzer	2024-06-14 02:18:59 +00:00
PyTorch MergeBot	0186b386cd	Revert "[ONNX] Add upsample trilinear to skip decomp (#128259 )" This reverts commit b72989a2b5ac4637612e31e325d7c8233fcbd7a1. Reverted https://github.com/pytorch/pytorch/pull/128259 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its ONNX job is failing in trunk `b72989a2b5` ([comment](https://github.com/pytorch/pytorch/pull/128259#issuecomment-2167058937))	2024-06-14 01:44:26 +00:00
anandptl84	f48ca2561d	Document `torch.cuda.profiler.start` (#128098 ) document https://github.com/pytorch/pytorch/issues/127917 start function of cuda/ profiler.py Fixes 127917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128098 Approved by: https://github.com/aaronenyeshi	2024-06-14 01:44:18 +00:00
Isuru Fernando	41df20c07c	Run all samples for torchinductor tests (#128343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343 Approved by: https://github.com/lezcano	2024-06-14 01:28:32 +00:00
PyTorch MergeBot	6895a5804c	Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795 )" This reverts commit c472cec5656b9ffb668af97a02d711bdbdf5ebec. Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))	2024-06-14 01:14:59 +00:00
Mengwei Liu	6564d63e69	Use mv kernel for small M (#128632 ) Previously we are using: * mv kernel for M == 1 * mm kernel for 1 < M < 4 * llama.cpp inspired mm kernel for M >= 4 This PR consolidate it to only 2 kernels, use the same mv kernel for M < 12. Benchmarked on https://github.com/malfet/llm_experiments/blob/main/metal-perf/int8mm.mm Mac M1 Max, input size M x 4128 x 4096 ![llama cpp shader and ATen shader (2)](https://github.com/pytorch/pytorch/assets/8188269/9e2e3024-c5ea-4303-88bf-ff3646296396) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128632 Approved by: https://github.com/malfet	2024-06-14 01:06:53 +00:00
Sheng Fu	ae2359638b	Save DOT file of graph instead of SVG for GraphTranformObserver (#128634 ) Summary: GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable. This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG. To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details. Test Plan: buck2 test mode/dev-sand caffe2/test:fx -- fx.test_fx_xform_observer.TestGraphTransformObserver Differential Revision: D58539182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634 Approved by: https://github.com/mengluy0125	2024-06-14 00:54:22 +00:00
Scott Wolchok	6f181756dc	Use by-column algorithm for fp16/bf16 CPUBlas gemm_transb kernels (#127318 ) Summary: #96074 (D44340826) changed the algorithm for 16-bit types for gemm_notrans_ and gemm_transb_ for the sake of precision. In this diff, we go back to the old algorithm for gemm_transb_, maintaining precision by allocating temporary space equal to (in elements, so actually double since we are accumulating 16-bit types into fp32) the size of `c` to accumulate into. Test Plan: Used https://github.com/malfet/llm_experiments (benchmarks/benchmark_torch_mm.py) to benchmark before and after: before: ``` mv_nt torch.float32 5.47 usec mv_nt torch.float16 8.45 usec mv_nt torch.bfloat16 183.43 usec mv_ta torch.float32 5.70 usec mv_ta torch.float16 24.17 usec mv_ta torch.bfloat16 97.27 usec notrans torch.float32 5.58 usec notrans torch.float16 25.18 usec notrans torch.bfloat16 63.11 usec trans_a torch.float32 5.59 usec trans_a torch.float16 68.94 usec trans_a torch.bfloat16 311.60 usec trans_b torch.float32 5.63 usec trans_b torch.float16 8.76 usec trans_b torch.bfloat16 29.17 usec ``` after: ``` mv_nt torch.float32 5.53 usec mv_nt torch.float16 8.57 usec mv_nt torch.bfloat16 188.17 usec mv_ta torch.float32 5.78 usec mv_ta torch.float16 28.59 usec mv_ta torch.bfloat16 98.45 usec notrans torch.float32 5.71 usec notrans torch.float16 26.08 usec notrans torch.bfloat16 64.06 usec trans_a torch.float32 5.72 usec trans_a torch.float16 32.21 usec trans_a torch.bfloat16 32.10 usec trans_b torch.float32 5.83 usec trans_b torch.float16 9.05 usec trans_b torch.bfloat16 29.66 usec ``` Also expanded coverage to a range of larger matrix-vector and matrix-matrix sizes. before: ``` Matrix-vector: m=1024, n=1024, k=1 ==================== notrans torch.float32 24.75 usec notrans torch.float16 258.04 usec notrans torch.bfloat16 245.64 usec trans_a torch.float32 26.94 usec trans_a torch.float16 692.09 usec trans_a torch.bfloat16 1709.53 usec m=4100, n=4100, k=1 ==================== notrans torch.float32 2811.48 usec notrans torch.float16 4192.06 usec notrans torch.bfloat16 4041.01 usec trans_a torch.float32 2778.38 usec trans_a torch.float16 17218.41 usec trans_a torch.bfloat16 27561.21 usec m=16384, n=16384, k=1 ==================== notrans torch.float32 60157.66 usec notrans torch.float16 64121.38 usec notrans torch.bfloat16 65714.65 usec trans_a torch.float32 84975.39 usec trans_a torch.float16 1024223.33 usec trans_a torch.bfloat16 1078683.21 usec Matrix-matrix: m=1024, n=1024, k=256 ==================== notrans torch.float32 302.55 usec notrans torch.float16 172869.06 usec notrans torch.bfloat16 172837.81 usec trans_a torch.float32 250.03 usec trans_a torch.float16 333373.38 usec trans_a torch.bfloat16 432760.00 usec m=4100, n=4100, k=128 ==================== notrans torch.float32 5278.56 usec notrans torch.float16 1426335.29 usec notrans torch.bfloat16 1404249.37 usec trans_a torch.float32 4818.63 usec trans_a torch.float16 2969936.17 usec trans_a torch.bfloat16 3432565.96 usec m=16384, n=16384, k=16 ==================== notrans torch.float32 72225.71 usec notrans torch.float16 1439875.54 usec notrans torch.bfloat16 1443716.33 usec trans_a torch.float32 221130.21 usec trans_a torch.float16 16910654.17 usec trans_a torch.bfloat16 21447377.63 usec ``` after: ``` Matrix-vector: m=1024, n=1024, k=1 ==================== notrans torch.float32 25.11 usec notrans torch.float16 252.76 usec notrans torch.bfloat16 238.58 usec trans_a torch.float32 26.62 usec trans_a torch.float16 167.40 usec trans_a torch.bfloat16 174.08 usec m=4100, n=4100, k=1 ==================== notrans torch.float32 2774.28 usec notrans torch.float16 3991.70 usec notrans torch.bfloat16 3945.44 usec trans_a torch.float32 3011.25 usec trans_a torch.float16 2666.85 usec trans_a torch.bfloat16 2686.95 usec m=16384, n=16384, k=1 ==================== notrans torch.float32 58682.15 usec notrans torch.float16 63077.52 usec notrans torch.bfloat16 63319.33 usec trans_a torch.float32 70549.57 usec trans_a torch.float16 42145.45 usec trans_a torch.bfloat16 42270.13 usec Matrix-matrix: m=1024, n=1024, k=256 ==================== notrans torch.float32 289.37 usec notrans torch.float16 179704.87 usec notrans torch.bfloat16 173490.33 usec trans_a torch.float32 330.89 usec trans_a torch.float16 42466.26 usec trans_a torch.bfloat16 42811.19 usec m=4100, n=4100, k=128 ==================== notrans torch.float32 4793.33 usec notrans torch.float16 1407557.04 usec notrans torch.bfloat16 1388212.17 usec trans_a torch.float32 4714.20 usec trans_a torch.float16 359406.58 usec trans_a torch.bfloat16 350419.42 usec m=16384, n=16384, k=16 ==================== notrans torch.float32 65757.08 usec notrans torch.float16 1427715.71 usec notrans torch.bfloat16 1440883.00 usec trans_a torch.float32 202263.44 usec trans_a torch.float16 1387522.33 usec trans_a torch.bfloat16 1762253.92 usec ``` We are improving, but still have a lot of room for improvement compared to float32 BLAS. Full disclosure: applying this same method to gemm_notrans (which does correspond to notrans in the benchmark's nomenclature) does not approve performance across the board; the 16KB x 16KB x 16 matmul regresses and I haven't figured out why yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127318 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-06-14 00:39:18 +00:00
Alnis Murtovi	18f5357f4f	Introduce heuristic for mixed_mm on A100 (#128232 ) This PR introduces a heuristic for tuned_mixed_mm. The heuristic is only enabled on an A100, because it has only been tested on an A100, and it is only enabled if force_mixed_mm="heuristic". I compared the heuristic to the aten fallback implementation and triton+autotune: Geometric mean speedup: 2.51 ``` m n k triton + autotune (GB/s) aten (GB/s) heuristic (GB/s) used_heuristic speedup (heuristic/aten) 1 4096 4096 456.95 134.59 459.37 True 3.41 1 4096 8192 523.93 138.29 553.50 True 4.00 1 4096 16394 233.70 161.62 234.14 True 1.45 1 8192 4096 633.25 140.64 574.86 True 4.09 1 8192 8192 737.54 147.41 690.26 True 4.68 1 8192 16394 413.67 175.88 408.68 True 2.32 1 16394 4096 717.22 167.22 665.36 True 3.98 1 16394 8192 812.69 177.17 815.90 True 4.61 1 16394 16394 473.17 178.58 435.11 True 2.44 4 4096 4096 479.46 134.80 486.74 True 3.61 4 4096 6333 174.27 106.74 171.64 True 1.61 4 4096 8192 567.14 138.32 571.09 True 4.13 4 4096 12313 179.65 105.91 180.03 True 1.70 4 4096 16394 222.96 145.54 222.81 True 1.53 4 6333 4096 491.78 126.37 473.20 True 3.74 4 6333 6333 268.79 143.40 269.75 True 1.88 4 6333 8192 783.80 135.12 796.23 True 5.89 4 6333 12313 286.35 142.37 287.30 True 2.02 4 6333 16394 362.47 139.66 361.47 True 2.59 4 8192 4096 642.73 140.53 641.88 True 4.57 4 8192 6333 287.65 137.63 287.38 True 2.09 4 8192 8192 738.42 150.16 721.59 True 4.81 4 8192 12313 301.27 146.18 302.31 True 2.07 4 8192 16394 415.37 167.66 393.41 True 2.35 4 12313 4096 823.66 141.81 745.40 True 5.26 4 12313 6333 433.92 148.17 429.83 True 2.90 4 12313 8192 984.60 149.30 988.95 True 6.62 4 12313 12313 452.00 150.87 452.50 True 3.00 4 12313 16394 609.88 159.20 609.71 True 3.83 4 16394 4096 779.44 157.46 777.10 True 4.94 4 16394 6333 402.93 139.50 309.47 True 2.22 4 16394 8192 950.38 175.49 949.67 True 5.41 4 16394 12313 414.62 153.99 315.95 True 2.05 4 16394 16394 497.56 174.97 461.77 True 2.64 16 4096 4096 475.92 134.45 478.57 True 3.56 16 4096 6333 146.36 112.50 145.35 True 1.29 16 4096 8192 560.00 138.22 557.19 True 4.03 16 4096 12313 152.02 105.06 151.27 True 1.44 16 4096 16394 222.48 156.72 222.88 True 1.42 16 6333 4096 692.41 122.14 696.88 True 5.71 16 6333 6333 220.74 140.90 225.41 True 1.60 16 6333 8192 813.56 140.21 820.28 True 5.85 16 6333 12313 232.48 131.19 232.55 True 1.77 16 6333 16394 367.39 134.93 361.87 True 2.68 16 8192 4096 665.54 140.29 266.24 True 1.90 16 8192 6333 254.77 136.65 240.12 True 1.76 16 8192 8192 750.63 146.26 736.93 True 5.04 16 8192 12313 266.61 127.13 251.81 True 1.98 16 8192 16394 397.25 160.42 390.76 True 2.44 16 12313 4096 857.48 141.36 851.36 True 6.02 16 12313 6333 423.21 132.40 357.55 True 2.70 16 12313 8192 1021.24 145.68 1024.60 True 7.03 16 12313 12313 370.12 143.94 383.52 True 2.66 16 12313 16394 608.52 141.03 608.48 True 4.31 16 16394 4096 826.48 155.94 826.74 True 5.30 16 16394 6333 420.38 144.09 265.23 True 1.84 16 16394 8192 988.07 156.21 984.63 True 6.30 16 16394 12313 431.40 146.92 265.49 True 1.81 16 16394 16394 497.39 167.86 461.79 True 2.75 23 4096 4096 344.43 132.84 338.64 True 2.55 23 4096 6333 195.34 118.48 195.31 True 1.65 23 4096 8192 389.83 140.02 376.62 True 2.69 23 4096 12313 204.49 137.96 204.80 True 1.48 23 4096 16394 242.48 148.99 242.74 True 1.63 23 6333 4096 429.25 126.52 517.75 True 4.09 23 6333 6333 295.56 133.51 296.14 True 2.22 23 6333 8192 594.88 137.05 581.78 True 4.25 23 6333 12313 315.18 131.67 314.64 True 2.39 23 6333 16394 386.46 141.45 386.54 True 2.73 23 8192 4096 553.52 142.05 568.35 True 4.00 23 8192 6333 215.58 139.01 210.86 True 1.52 23 8192 8192 609.21 154.85 528.76 True 3.41 23 8192 12313 220.38 142.93 233.54 True 1.63 23 8192 16394 402.63 158.39 403.21 True 2.55 23 12313 4096 723.54 131.58 581.94 True 4.42 23 12313 6333 307.90 131.58 307.90 True 2.34 23 12313 8192 893.36 129.97 623.72 True 4.80 23 12313 12313 322.40 134.84 317.80 True 2.36 23 12313 16394 512.97 142.31 409.45 True 2.88 23 16394 4096 703.66 154.54 643.53 True 4.16 23 16394 6333 305.55 127.55 293.17 True 2.30 23 16394 8192 768.12 154.60 681.53 True 4.41 23 16394 12313 311.61 140.92 307.01 True 2.18 23 16394 16394 467.24 171.07 467.29 True 2.73 32 4096 4096 344.71 132.30 338.62 True 2.56 32 4096 6333 206.48 107.59 205.55 True 1.91 32 4096 8192 387.24 137.82 353.12 True 2.56 32 4096 12313 216.35 120.61 214.50 True 1.78 32 4096 16394 242.05 149.92 241.94 True 1.61 32 6333 4096 525.50 127.12 518.02 True 4.08 32 6333 6333 300.50 118.41 296.55 True 2.50 32 6333 8192 600.92 136.99 601.94 True 4.39 32 6333 12313 316.13 136.45 316.03 True 2.32 32 6333 16394 386.11 141.34 386.10 True 2.73 32 8192 4096 546.18 140.18 341.14 True 2.43 32 8192 6333 218.40 130.65 263.42 True 2.02 32 8192 8192 608.29 147.16 542.12 True 3.68 32 8192 12313 225.60 135.04 225.23 True 1.67 32 8192 16394 434.75 160.42 401.28 True 2.50 32 12313 4096 787.80 136.28 583.60 True 4.28 32 12313 6333 316.66 125.76 323.35 True 2.57 32 12313 8192 891.38 128.88 639.50 True 4.96 32 12313 12313 326.11 132.37 325.88 True 2.46 32 12313 16394 521.64 139.47 395.69 True 2.84 32 16394 4096 625.55 158.46 651.16 True 4.11 32 16394 6333 304.14 131.13 284.55 True 2.17 32 16394 8192 767.79 162.95 704.34 True 4.32 32 16394 12313 310.74 137.68 303.39 True 2.20 32 16394 16394 465.92 171.43 465.37 True 2.71 43 4096 4096 345.05 133.87 196.47 True 1.47 43 4096 6333 148.64 99.92 148.97 True 1.49 43 4096 8192 386.50 135.39 214.00 True 1.58 43 4096 12313 190.39 109.36 156.27 True 1.43 43 4096 16394 203.63 150.24 204.05 True 1.36 43 6333 4096 421.35 106.04 132.25 True 1.25 43 6333 6333 224.75 113.01 224.97 True 1.99 43 6333 8192 471.11 117.61 327.39 True 2.78 43 6333 12313 234.55 115.61 234.74 True 2.03 43 6333 16394 311.56 132.24 312.01 True 2.36 43 8192 4096 400.73 140.12 269.11 True 1.92 43 8192 6333 167.32 119.13 168.84 True 1.42 43 8192 8192 435.45 146.98 286.21 True 1.95 43 8192 12313 161.05 127.82 162.78 True 1.27 43 8192 16394 207.16 156.40 208.90 True 1.34 43 12313 4096 484.01 120.10 313.35 True 2.61 43 12313 6333 234.54 106.63 232.85 True 2.18 43 12313 8192 515.34 130.23 411.70 True 3.16 43 12313 12313 239.39 130.04 239.03 True 1.84 43 12313 16394 316.02 137.39 316.29 True 2.30 43 16394 4096 475.60 152.57 340.97 True 2.23 43 16394 6333 241.21 132.49 208.59 True 1.57 43 16394 8192 499.34 157.43 361.61 True 2.30 43 16394 12313 246.25 132.31 211.68 True 1.60 43 16394 16394 302.90 158.56 277.05 True 1.75 64 4096 4096 280.48 126.82 195.97 True 1.55 64 4096 6333 150.94 101.63 150.48 True 1.48 64 4096 8192 305.47 135.06 211.03 True 1.56 64 4096 12313 158.12 110.06 158.15 True 1.44 64 4096 16394 206.68 136.21 201.28 True 1.48 64 6333 4096 409.11 105.10 296.07 True 2.82 64 6333 6333 229.98 108.46 230.59 True 2.13 64 6333 8192 469.32 112.24 330.58 True 2.95 64 6333 12313 245.02 117.16 244.84 True 2.09 64 6333 16394 317.78 125.80 318.37 True 2.53 64 8192 4096 323.42 139.92 267.31 True 1.91 64 8192 6333 167.51 118.45 167.56 True 1.41 64 8192 8192 341.13 146.71 284.88 True 1.94 64 8192 12313 172.21 123.42 171.97 True 1.39 64 8192 16394 217.22 153.18 216.99 True 1.42 64 12313 4096 482.19 123.32 311.82 True 2.53 64 12313 6333 238.73 123.88 238.66 True 1.93 64 12313 8192 516.32 122.11 330.50 True 2.71 64 12313 12313 248.73 125.32 296.82 True 2.37 64 12313 16394 314.98 134.06 320.31 True 2.39 64 16394 4096 476.59 154.58 340.84 True 2.20 64 16394 6333 240.54 119.60 214.82 True 1.80 64 16394 8192 501.36 149.02 359.45 True 2.41 64 16394 12313 244.65 126.01 222.47 True 1.77 64 16394 16394 302.48 160.36 283.66 True 1.77 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128232 Approved by: https://github.com/Chillee	2024-06-14 00:31:22 +00:00
cyy	9ebec1f345	Enable Wunused-function in torch_cpu (#128576 ) Follows #128499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128576 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-14 00:12:58 +00:00
Jane Xu	6767e38267	Fix manual licensing (#128630 ) It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for: clog - BSD-3 eigen - MPL-2.0 ffnvcodec - LGPL-2.1 -> hungarian - Permissive (free to use) irrlicht - The Irrlicht Engine License (zlib/libpng) -> pdcurses - Public Domain for core -> sigslot - Public Domain test - BSD-3 Vulkan - Apache-2.0 or MIT fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967 This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630 Approved by: https://github.com/malfet	2024-06-14 00:12:09 +00:00
Yidi Wu	afdaa7fc95	[while_loop] expose it as torch.while_loop (#128562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128562 Approved by: https://github.com/zou3519	2024-06-13 23:44:10 +00:00
chilli	c486e2ab64	Add coloring to fx graph print out (#128476 ) Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail. Old: <img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6"> New: <img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476 Approved by: https://github.com/ezyang	2024-06-13 23:39:04 +00:00
rzou	61421c42c0	[custom_op] don't invoke autograd.Function when unnecessary (#127976 ) This matches our autograd logic for pytorch native operators. There's no need to invoke an autograd.Function if we're under a torch.no_grad() or if none of the inputs have requires_grad=True (invoking an autograd.Function results in (noticeable) overhead). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976 Approved by: https://github.com/williamwen42	2024-06-13 23:38:23 +00:00
titaiwangms	b72989a2b5	[ONNX] Add upsample trilinear to skip decomp (#128259 ) (1) Add upsample trilinear vec to skip decomposition (2) Add tests to make sure that torch.export.export still decomposes them Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259 Approved by: https://github.com/justinchuby	2024-06-13 23:31:34 +00:00
Jane Xu	8c20f53a5e	Try seeding individual foreach tests (#128220 ) A first easy attempt to deflake foreach Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220 Approved by: https://github.com/ZainRizvi, https://github.com/crcrpar, https://github.com/huydhn	2024-06-13 22:42:16 +00:00
Animesh Jain	865d7b3424	[Reland][dynamo] Enable some inlining inbuilt nn module tests (#128440 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-13 22:39:22 +00:00
Shangdi Yu	3a0006ef22	Remove global variable SIZE, and fix linter warning (#128559 ) - Resolve a TODO by removing global variable `SIZE`. - Fix a linter warning in `test/test_nestedtensor.py`. `pytest pytorch/test/test_sort_and_select.py` and ` pytest test/test_nestedtensor.py` pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128559 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-06-13 22:09:51 +00:00
Andrew Hoblitzell	6211e67e49	Document `torch.jit.frontend.get_default_args` (#128408 ) Fixes #127896 ### Description Add docstring to `torch/jit/frontend.py:get_default_args` function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408 Approved by: https://github.com/malfet	2024-06-13 21:49:16 +00:00
Andrew Gu	bf8a05f483	[FSDP2] Included module FQN in `FSDPParamGroup` `record_function`s (#128624 ) This PR adds the module FQN into the `FSDPParamGroup` `record_function`s for improved clarity in profiler traces. Differential Revision: [D58544809](https://our.internmc.facebook.com/intern/diff/D58544809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128624 Approved by: https://github.com/ckluk2	2024-06-13 21:35:33 +00:00
PyTorch MergeBot	c8e9656a12	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit 49366b2640df1cba5a3b40bedd31b57b08529612. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes an inductor test to fail in trunk `49366b2640` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2166824714))	2024-06-13 21:30:07 +00:00
Jing Xu	8763d44bf1	add xpu to torch.compile (#127279 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279 Approved by: https://github.com/dvrogozh, https://github.com/svekars	2024-06-13 21:15:09 +00:00
Yifu Wang	790138fdc7	Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter (#127556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127556 Approved by: https://github.com/awgu ghstack dependencies: #127454, #127455	2024-06-13 20:52:46 +00:00
Yifu Wang	3b28dc6c9d	Improve the scheduling for fused_matmul_reduce_scatter (#127455 ) In fused_all_gather_matmul, each rank copies their shard into their local p2p buffer, performs a barrier, then performs (copy -> matmul) for each remote shard. The (copy -> matmul)s for remote shards run on two streams without synchronization. This not only allows for computation/communication overlapping, but also computation/computation overlapping which alleviates the wave quantization effect caused by computation decomposition. However, the synchronization-free approach doesn't work well with fused_matmul_reduce_scatter, in which there's a barrier in every step. Without synchronization between the two streams, a matmul in one stream can delay a barrier in the other stream, further delaying the copy waiting for the barrier. This PR addresss the issue by adding synchronization between the two streams such that the matmul of step i can only start after the barrier of step i-1 completes. With this approach, we lose the computation/computation overlapping, but avoid slowdown due to delayed barrier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455 Approved by: https://github.com/Chillee ghstack dependencies: #127454	2024-06-13 20:52:46 +00:00
Arun Pa	c0b40ab42e	doc string for torch.jit.frontend.get_jit_class_def method (#128391 ) Fixes #127904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128391 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-06-13 19:51:02 +00:00
James Wu	a3af32c2fb	Add functionality to make ViewAndMutationData (slightly more) cache safe (#127618 ) This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses. That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache. To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function. By doing this, we can now run the dynamic shapes cache test with autograd turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618 Approved by: https://github.com/bdhirsh	2024-06-13 19:45:33 +00:00
Sam Larsen	39193b10e8	[inductor] fx graph cache: memoize devices to make cache key calculation more predictable (#128366 ) Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this: - We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key. - Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID. - That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects. The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation? I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-06-13 19:25:14 +00:00
Shunting Zhang	c54e358bdb	enable comprehensive padding internally (#128555 ) Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature. Test Plan: ``` time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus' ``` Differential Revision: D58425432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555 Approved by: https://github.com/eellison	2024-06-13 19:20:00 +00:00
Isuru Fernando	cdc37e4bff	Add a shape property to IR nodes (#127818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127818 Approved by: https://github.com/peterbell10	2024-06-13 19:11:52 +00:00
Xuehai Pan	5a80d2df84	[BE] enable UFMT for `torch/nn/utils` (#128595 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128595 Approved by: https://github.com/Skylion007	2024-06-13 18:34:57 +00:00
Bin Bao	9f55c80a9f	[AOTI] Fix a minimal_arrayref_interface test failure (#128613 ) Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter. Test Plan: CI Differential Revision: D58528575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613 Approved by: https://github.com/hl475	2024-06-13 18:25:04 +00:00
vasiliy	a265556362	inductor fusion logs: make it easier to attribute to aten graph (#127159 ) Summary: I want to be able to look at inductor fusion logs and reason about which parts of the aot_autograd aten graph were fused / not fused. This PR adds a short description of each buffer to the fusion logs. Example for forward of `Float8Linear`: ``` torch._inductor.scheduler.__fusion: ===== attempting fusion (1/10): 13 nodes ===== torch._inductor.scheduler.__fusion: fuse_nodes_once, candidates: torch._inductor.scheduler.__fusion: SchedulerNode(name='buf0'), Reduction(['[254201]', 'max', 'origins={abs_1, max_1}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf3'), Reduction(['[114688]', 'max', 'origins={abs_2, max_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf6'), Pointwise(['[]', 'origins={reciprocal_1, convert_element_type_6, clamp_min_2, mul_2, copy_1, reciprocal_3, convert_element_type_5}']) torch._inductor.scheduler.__fusion: ExternKernelSchedulerNode(name='buf10') torch._inductor.scheduler.__fusion: SchedulerNode(name='buf2'), Pointwise(['[]', 'origins={full_default}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf8'), Pointwise(['[8192, 7168]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_type _3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf4'), Reduction(['[512]', 'max', 'origins={abs_2, max_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf13'), Pointwise(['[8192, 7168]', 'origins={clone_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf7'), Pointwise(['[16384, 8192]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_typ e_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}']) torch._inductor.scheduler.__fusion: ExternKernelSchedulerNode(name='buf9') torch._inductor.scheduler.__fusion: SchedulerNode(name='buf1'), Reduction(['[528]', 'max', 'origins={abs_1, max_1}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf5'), Pointwise(['[]', 'origins={convert_element_type, clamp_min, convert_element_type_1, copy, reciprocal_2, mul, reciprocal}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf12'), Pointwise(['[8192, 16384]', 'origins={clone_1}']) torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf7: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf12: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf1: numel/rnumel mismatch (reduce) (528, 1), (254201, 528) torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf1: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf12 with buf1: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf7: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf12: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf8: intermediate nodes between node1 & node2 torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf13: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf4: numel/rnumel mismatch (reduce) (512, 1), (114688, 512) torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf4: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf13 with buf4: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf8: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf13: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf9 with buf10: node1 is extern or nop torch._inductor.scheduler.__fusion: found 4 possible fusions torch._inductor.scheduler.__fusion: fusing buf7 with buf12 torch._inductor.scheduler.__fusion: fusing buf8 with buf13 torch._inductor.scheduler.__fusion: fusing buf4 with buf6 torch._inductor.scheduler.__fusion: fusing buf1 with buf5 torch._inductor.scheduler.__fusion: completed fusion round (1/10): fused 13 nodes into 9 nodes ``` Test Plan: will add tests after we align some version of this can land Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127159 Approved by: https://github.com/mlazos	2024-06-13 18:22:02 +00:00
JJ Asghar	de9a072ac4	Updating the `sigslot` license to Public Domain (#128085 ) It seems that Sigslot's license is Public Domain, not Apache 2. https://sigslot.sourceforge.net Pull Request resolved: https://github.com/pytorch/pytorch/pull/128085 Approved by: https://github.com/janeyx99	2024-06-13 18:13:54 +00:00
Thanh Ha	8733c4f4be	docs: Add link to test-infra issue (#128608 ) It's not immediately obvious from this file that the issue being referred to is in another repo. Add that detail and link to make it easier for folks reading this code to jump to the correct issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128608 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/ZainRizvi	2024-06-13 18:00:53 +00:00
PyTorch MergeBot	dd19c9150c	Revert "[aota] compiled forward outputs requires_grad alignment with eager (#128016 )" This reverts commit b459713ca75f6ab7c8a59acec0258e0f77904ada. Reverted https://github.com/pytorch/pytorch/pull/128016 on behalf of https://github.com/bdhirsh due to fix torchbench regression ([comment](https://github.com/pytorch/pytorch/pull/128016#issuecomment-2166446841))	2024-06-13 17:56:42 +00:00
Yifu Wang	52f529105d	force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation (#127454 ) When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation. This copy can be avoided if the lhs operand already has the following stride order: lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride() In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454 Approved by: https://github.com/Chillee	2024-06-13 17:52:37 +00:00
Joel Schlosser	d5780396c7	Skip debug asserts for mixed dense, subclass views in autograd_not_implemented_fallback (#128057 ) Fixes #125503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128057 Approved by: https://github.com/albanD, https://github.com/soulitzer ghstack dependencies: #127007	2024-06-13 17:13:02 +00:00
Joel Schlosser	9a8917fdbd	Naive CPU kernels for jagged <-> padded dense conversions (#127007 ) This PR introduces naive CPU impls for: * `_jagged_to_padded_dense_forward()` * `_padded_dense_to_jagged_forward()` On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007 Approved by: https://github.com/davidberard98	2024-06-13 17:13:02 +00:00
Animesh Jain	a0604193a2	handle call_function with Parameter args in DDPOptimizer splitting (#128034 ) When nn module inlining is enabled, modules are replaced with the underlying function calls in the output fx graph. example: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_: "f32[1024, 1024]"): l_x_ = L_x_ # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_structured_trace.py:284 in forward, code: return self.layers(x) l__self___layers_0: "f32[1024, 1024]" = self.L__self___layers_0(l_x_); l_x_ = None l__self___layers_1: "f32[1024, 1024]" = self.L__self___layers_1(l__self___layers_0); l__self___layers_0 = None return (l__self___layers_1,) ``` will be ``` class GraphModule(torch.nn.Module): def forward(self, L_self_layers_0_weight: "f32[1024, 1024]", L_self_layers_0_bias: "f32[1024]", L_x_: "f32[1024, 1024]", L_self_layers_1_weight: "f32[1024, 1024]", L_self_layers_1_bias: "f32[1024]"): l_self_layers_0_weight = L_self_layers_0_weight l_self_layers_0_bias = L_self_layers_0_bias l_x_ = L_x_ l_self_layers_1_weight = L_self_layers_1_weight l_self_layers_1_bias = L_self_layers_1_bias # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias) input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_x_, l_self_layers_0_weight, l_self_layers_0_bias); l_x_ = l_self_layers_0_weight = l_self_layers_0_bias = None input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_self_layers_1_weight, l_self_layers_1_bias); input_1 = l_self_layers_1_weight = l_self_layers_1_bias = None return (input_2,) ``` The DDP optimizer when performing splitting, does not handle the inlined graph since it does not handle function calls since earlier we did not have function calls with params as inputs. (but calls to modules instead). This diff addresses that, it uses the example_value in the arguments to determine Parameter arguments of a function call and the Parameter properties. This address #https://github.com/pytorch/pytorch/issues/127552 running the optimizer on the code above with inlining yields to the following splitting: ``` ---submod_0 graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_] %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_weight] %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_bias] %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {}) return linear ---submod_1 graph--- graph(): %input_1 : [num_users=1] = placeholder[target=input_1] %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_weight] %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_bias] %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%input_1, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {}) return linear ---final graph--- graph(): %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_weight] %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_bias] %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_] %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_weight] %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_bias] %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {}) %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {}) return (submod_1,) --------------- ``` where as without inlining it uses to be ``` ---submod_0 graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_] %l__self___layers_0 : [num_users=1] = call_module[target=L__self___layers_0](args = (%l_x_,), kwargs = {}) return l__self___layers_0 /data/users/lsakka/pytorch/pytorch/torch/_inductor/compile_fx.py:133: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( ---submod_1 graph--- graph(): %l__self___layers_0 : [num_users=1] = placeholder[target=l__self___layers_0] %l__self___layers_1 : [num_users=1] = call_module[target=L__self___layers_1](args = (%l__self___layers_0,), kwargs = {}) return l__self___layers_1 ---final graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_] %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_,), kwargs = {}) %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0,), kwargs = {}) return (submod_1,) --------------- ``` TESTING: (1) running ``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1 pytest test/distributed/test_dynamo_distributed.py -k ``` result in reduction in failures from 6 to 2 with this PR. The two remaining are FSDP related which does not sounds trivial and have so many details. will leave them for future work. Co-authored-by: Animesh Jain <anijain@umich.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128034 Approved by: https://github.com/anijain2305, https://github.com/wconstab	2024-06-13 17:07:27 +00:00
lezcano	3e3435678c	Remove some implications from the static_eval pattern matcher (#128500 ) We should be able to remove this as, with the new canonicalisation, we have that `a < b` and `-a > -b` should be canonicalised to the same expression (if SymPy does not interfere too much). nb. I thought this would cut further the compilation time, but I was running the benchmarks wrong (not removing triton's cache oops). It turns out that after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500 Approved by: https://github.com/ezyang ghstack dependencies: #128410, #128411	2024-06-13 16:50:00 +00:00
lezcano	0fdd8d84fa	Do not generate -1* in SymPy expressions when canonicalising (#128411 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128411 Approved by: https://github.com/ezyang ghstack dependencies: #128410	2024-06-13 16:49:59 +00:00
lezcano	bdeb9225b0	Do not call `get_implications` unnecessarily (#128410 ) This should improve compilation times. With this PR and the patch in the original issue, I get a compilation time of `Compilation time: 307.30 second`. Fixes https://github.com/pytorch/pytorch/issues/128398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128410 Approved by: https://github.com/Chillee	2024-06-13 16:49:55 +00:00
cyy	e2a72313e8	Concat namespaces of torch/csrc/profiler code and other fixes (#128606 ) Improve namespaces and modernize codebase of torch/csrc/profiler code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128606 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-06-13 16:46:34 +00:00
Tristan Rice	7c370d2fb0	expose set_thread_name to Python and set thread names (#128448 ) This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process. Threads named: * torchrun/elastic * PyTorch dataloader worker processes + pin memory thread * TCPStore * ProcessGroupNCCL background threads * WorkerServer httpserver thread Test plan: ``` $ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL \| grep pt_' 3264281 3264281 pts/45 00:00:02 pt_elastic 3264281 3267950 pts/45 00:00:00 pt_elastic ``` dataloading ```py import torch import time from torch.utils.data import ( DataLoader, Dataset, ) class NoopDataset(Dataset): def __getitem__(self, index): return index def __len__(self): return 10 dataloader = DataLoader(NoopDataset(), num_workers=2) for i, x in enumerate(dataloader): print(i, x) time.sleep(10000) ``` ``` $ python3 ~/scripts/dataloader_test.py $ ps -eL \| grep pt_ 1228312 1228312 pts/45 00:00:02 pt_main_thread 1228312 1230058 pts/45 00:00:00 pt_main_thread 1228312 1230059 pts/45 00:00:00 pt_main_thread 1230052 1230052 pts/45 00:00:00 pt_data_worker 1230052 1230198 pts/45 00:00:00 pt_data_worker 1230052 1230740 pts/45 00:00:00 pt_data_worker 1230055 1230055 pts/45 00:00:00 pt_data_worker 1230055 1230296 pts/45 00:00:00 pt_data_worker 1230055 1230759 pts/45 00:00:00 pt_data_worker ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448 Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro	2024-06-13 16:38:23 +00:00
Zain Rizvi	b05b8d3989	[EZ][ALI Migration] Add logging for workflow type determination (#128619 ) To help figure out what went wrong when the wrong label appears to have been set Pull Request resolved: https://github.com/pytorch/pytorch/pull/128619 Approved by: https://github.com/zxiiro, https://github.com/clee2000	2024-06-13 16:37:07 +00:00
Yidi Wu	e9b81e4edf	Fakify torch bind input by default (#128454 ) Summary: Try a reland of https://github.com/pytorch/pytorch/pull/127116 after some fixes landed Differential Revision: D58418251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128454 Approved by: https://github.com/angelayi	2024-06-13 16:25:11 +00:00
PyTorch MergeBot	c63ccead5e	Revert "[dynamo] Enable some inlining inbuilt nn module tests (#128440 )" This reverts commit 1602c7d0c861a4382746ccb18c76d8703a636f4e. Reverted https://github.com/pytorch/pytorch/pull/128440 on behalf of https://github.com/clee2000 due to new test broke internally D58501220 ([comment](https://github.com/pytorch/pytorch/pull/128440#issuecomment-2166127531))	2024-06-13 16:14:37 +00:00
Oguz Ulgen	17b45e905a	Fix get output code when caching is enabled (#128445 ) Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits. Test Plan: ci Differential Revision: D58429602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral	2024-06-13 16:00:30 +00:00
Aaron Gokaslan	93a14aba6e	[BE]: Update mypy to 1.10.0 (#127717 ) Updates mypy to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717 Approved by: https://github.com/ezyang	2024-06-13 15:57:13 +00:00
Wu, Chunyuan	49366b2640	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-13 15:32:15 +00:00
xinan.lin	cf7adc2fa1	[Inductor] Update Intel GPU Triton commit pin. (#124842 ) Update Intel triton for Pytorch 2.4 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124842 Approved by: https://github.com/EikanWang	2024-06-13 14:34:37 +00:00
Tom Ritchford	edb45dce85	Add OpInfo entry for as_strided_copy (#127231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231 Approved by: https://github.com/lezcano	2024-06-13 13:58:47 +00:00
rzou	7cc07a3eb1	[custom_op] stop using nonlocals to store information (#128547 ) Fixes https://github.com/pytorch/pytorch/issues/128544 Fixes https://github.com/pytorch/pytorch/issues/128535 We had a problem with multithreading where the nonlocals were being clobbered. In the first place, we stored these nonlocals because we wanted to ferry information from an autograd.Function.apply to autograd.Function.forward. Our new approach is: - pass the information directly as an input to the autograd.Function.apply. This means that the autograd.Function.forward will receive the information too. - this messes up ctx.needs_input_grad, which has an element per input to forward. The user should not see the additional information we passed. We fix this by temporarily overriding ctx.needs_input_grad to the right thing. - this exposed a bug in that ctx.needs_input_grad wasn't correct for TensorList inputs. This PR fixes that too. Test Plan: - existing and new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547 Approved by: https://github.com/williamwen42, https://github.com/soulitzer	2024-06-13 13:36:39 +00:00
IvanKobzarev	2b9465d62a	[aota] Allow some mutations in backward (#128409 ) https://github.com/pytorch/pytorch/issues/127572 Allow mutations in backward on forward inputs, if 1/ not mutationg metadata Enforced at compilation time. 2/ if create_graph=True: mutated input does not require_grad Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled() Adding input_joint_info to track mutations of inputs during joint. Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409 Approved by: https://github.com/bdhirsh	2024-06-13 12:09:08 +00:00
Laith Sakka	d0c08926d1	allow inlining functions in _python_dispatch and _is_make_fx_tracing (#128485 ) This fix grab breaks in torch_multimodal_clip benchmark. Co-authored-by: Animesh Jain <anijain@umich.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128485 Approved by: https://github.com/anijain2305 ghstack dependencies: #128428	2024-06-13 09:56:39 +00:00
Jiong Gong	1fd2cd26a0	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel	2024-06-13 09:46:22 +00:00
Jason Ansel	c897651392	[inductor] Add BackendFeature gating (#128266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266 Approved by: https://github.com/shunting314	2024-06-13 07:31:51 +00:00
Yu, Guangye	88974fedd0	Clean up xpu ut to make CI happy (#128383 ) # Motivation Before #127611 merged, the xpu-specific UT `test/test_xpu.py` was skipped temporarily. This PR aims to fix the UT bug introduced by #127741. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128383 Approved by: https://github.com/EikanWang	2024-06-13 07:06:41 +00:00
Eddie Yan	ce79b09415	[CUDA][Sparse] Change comparison function of `test_sparse_semi_structured.py` and bump tolerances for `sp24_matmuls` (#128553 ) Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553 Approved by: https://github.com/jcaip	2024-06-13 06:58:07 +00:00
Nikita Shulga	0678742924	[MPS] Add Metal implementation of exp op (#128421 ) To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU Fix bug in non-contiguous tensors handling Fixes https://github.com/pytorch/pytorch/issues/84936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421 Approved by: https://github.com/kulinseth ghstack dependencies: #128373, #128375	2024-06-13 06:53:17 +00:00
Wang, Eikan	14c9eb5ed2	Add XPU code owners (#128486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128486 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-13 06:33:45 +00:00
Catherine Lee	518c9e6455	Forward fix lint (#128587 ) merge at will After https://github.com/pytorch/pytorch/pull/125968 and https://github.com/pytorch/pytorch/pull/127693 landrace Pull Request resolved: https://github.com/pytorch/pytorch/pull/128587 Approved by: https://github.com/huydhn	2024-06-13 06:19:03 +00:00
Animesh Jain	c52eda896e	[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428 Approved by: https://github.com/yanboliang, https://github.com/mlazos ghstack dependencies: #126578, #128440, #128470, #128453, #128484	2024-06-13 06:08:56 +00:00
Animesh Jain	1f6e84fa68	[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484 Approved by: https://github.com/mlazos ghstack dependencies: #126578, #128440, #128470, #128453	2024-06-13 06:08:56 +00:00
Shaz Qadeer	ea541dd965	SymIntify cross_entropy_loss_prob_target numel call (#128141 ) This PR replaces call to ```numel``` with ```sym_numel``` in cross_entropy_loss_prob_target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128141 Approved by: https://github.com/ezyang	2024-06-13 05:37:17 +00:00
Mengwei Liu	ade3d07483	GGML inspired int8 MM Metal shader (#127646 ) ## Context This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library. llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE ## Key Changes Made the following changes to the original code: * Memory layout of weight and scales is different than llama.cpp. * Weight dequantization (scales multiplication) is done after MM is finished. * Following PyTorch naming convention (M, K, N and assuming row major). ## Benchmark When M = 1, mv shader improves existing ATen int8mm by 40%. When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow. ![image](https://github.com/pytorch/pytorch/assets/8188269/fd9eff71-c538-4263-a7b5-f96fe479ae9d) Hence the kernel chooses different shaders based on M. ## Test Plan Tests are passing: ``` ❯ python test/test_mps.py -v -k _int8_ /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so Expected in: <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok ---------------------------------------------------------------------- Ran 12 tests in 1.180s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646 Approved by: https://github.com/malfet	2024-06-13 05:23:56 +00:00
Michael Lazos	b86b4ace88	Invalidate eager params when inlining and freezing nn modules (#128543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128543 Approved by: https://github.com/anijain2305	2024-06-13 04:50:17 +00:00
Xuehai Pan	83bb9b7c53	[BE] explicitly export subpackage `torch.utils` (#128342 ) Resolves #126401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128342 Approved by: https://github.com/Skylion007 ghstack dependencies: #127707	2024-06-13 04:39:16 +00:00
Edward Z. Yang	2229884102	Introduce int_oo (#127693 ) In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range. After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better. But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. test/test_sympy_utils.py describes some basic properties of the number, and torch/utils/_sympy/numbers.py has the actual implementation. The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments. Fixes https://github.com/pytorch/pytorch/issues/127396 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693 Approved by: https://github.com/lezcano ghstack dependencies: #126905	2024-06-13 04:08:20 +00:00
Shivam Raikundalia	d3b8230639	Fix profiler_kineto Clang errors (#128464 ) Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense. Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass Differential Revision: D58431005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464 Approved by: https://github.com/aaronenyeshi	2024-06-13 03:10:50 +00:00
PyTorch MergeBot	d630e1e838	Revert "[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 )" This reverts commit f2d7f235a684c593f5a1ff2ca0b47b47274bfe85. Reverted https://github.com/pytorch/pytorch/pull/128269 on behalf of https://github.com/anijain2305 due to incorrect ([comment](https://github.com/pytorch/pytorch/pull/128269#issuecomment-2164267320))	2024-06-13 03:04:26 +00:00
Jing Xu	7fe9ab9ccc	update amp example to device-agnostic (#127278 ) As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic. Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars	2024-06-13 02:01:16 +00:00
cyy	3f9b8446cf	[8/N] Remove unused functions (#128499 ) Follows #128407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499 Approved by: https://github.com/malfet	2024-06-13 01:15:11 +00:00
Xu Han	ede74940a1	optimize vec isa check dispatch logical. (#128320 ) Optimize cpu vec isa check dispatch by archecture, it makes code easy to read and maintaince. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128320 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-13 01:06:34 +00:00
Yidi Wu	c1cd946818	[cond] add a set_ and data mutation expected failure test (#128457 ) A follow up of the discussion in https://github.com/pytorch/pytorch/pull/126936. Cond errors out early because of a graph break triggered by DelayGraphBreakVariable, which is created due to `aten.set_` [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/tensor.py#L366-L376). We might need to see what happened to this test if we allow graph break in higher order op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128457 Approved by: https://github.com/zou3519	2024-06-13 00:16:59 +00:00
soulitzer	c472cec565	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. "bc-breaking" for existing users of the private API: - Existing policy functions must now change their return value to use the Enum. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-12 23:57:33 +00:00
pradeepfn	25b7537a27	doc comment typo fixes and improvements (#128512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128512 Approved by: https://github.com/LucasLLC	2024-06-12 23:55:09 +00:00
Huamin Li	eb1db6702f	[2nd try][AOTI] Switch to use shim v2 (#128521 ) Test Plan: Sandcastle Differential Revision: D58470269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128521 Approved by: https://github.com/desertfire	2024-06-12 23:44:24 +00:00
Andrey Talman	4423e1bbdc	[release] Increase version 2.4.0->2.5.0 (#128514 ) Same as https://github.com/pytorch/pytorch/pull/121974 Branch cut for 2.4.0 completed hence advance main version to 2.5.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128514 Approved by: https://github.com/malfet	2024-06-12 23:40:01 +00:00
Angela Yi	3bc2004f91	[ts_converter] Fix prim::dtype (#128517 ) Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch. Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909 Reviewed By: jiashenC Differential Revision: D58469232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517 Approved by: https://github.com/jiashenC	2024-06-12 23:02:50 +00:00
Edward Z. Yang	2fa6f80b13	Perform reciprocal optimization with foreach_div (#128433 ) Fixes https://github.com/pytorch/pytorch/issues/114165 Internal xref https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433 Approved by: https://github.com/awgu	2024-06-12 22:57:03 +00:00
Shaz Qadeer	8db4a41973	Use computeStorageNbytesContiguous if possible (#128515 ) ```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515 Approved by: https://github.com/ezyang	2024-06-12 22:53:06 +00:00
Prachi Gupta	e2610240f9	[ROCm] Enable several inductor UTs (#127761 ) Fixes #ISSUE_NUMBER Needs https://github.com/pytorch/pytorch/pull/125396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127761 Approved by: https://github.com/peterbell10, https://github.com/pruthvistony	2024-06-12 22:47:45 +00:00
Joel Schlosser	bb3cf8a339	Lift inductor lowerings for jagged <-> padded dense kernels (#125968 ) This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops: * `_jagged_to_padded_dense_forward()` * `_padded_dense_to_jagged_forward()` * NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968 Approved by: https://github.com/davidberard98	2024-06-12 22:46:09 +00:00
Sam Larsen	b4a7b543e5	Add targeted unit tests for guards-related functions used in the codecache (#128482 ) Summary: Add a few unit tests that exercise `produce_guards_expression` and `evaluate_guards_expression` (and specifically "ToFloat" "FloatTrueDiv" added in https://github.com/pytorch/pytorch/pull/128418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128482 Approved by: https://github.com/ezyang ghstack dependencies: #128418	2024-06-12 22:41:50 +00:00
Wang, Eikan	1f302d6885	Support aten operations with out tensor (#124926 ) This PR intends to support the aten operations with the `out` tensor. Currently, the AOT compile always does NOT keep input tensor mutations. According to the comments, this is because it has not encountered such a use case. > For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to. However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph. Take `clamp` as an example as follows. ```python out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0) inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0) min_tensor = inp_tensor - 0.05 max_tensor = inp_tensor + 0.05 torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor) ``` W/O this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None return (clamp_max, clamp_max) ``` W/ this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None return (copy_,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi	2024-06-12 22:31:59 +00:00
Shengbao Zheng	f4edd67fe7	[c10d] fix OSS commSplit bug (#128459 ) Summary: D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL` Revert this change from D56907877 Test Plan: CI Differential Revision: D58436088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459 Approved by: https://github.com/shuqiangzhang	2024-06-12 22:29:01 +00:00
rzou	f39ab8a0fe	Fix side effect pruning (#128028 ) Summary: The previous side effect pruning algorithm would keep many dead cell variables alive. For example, in https://github.com/pytorch/pytorch/issues/125078, the compiled function has one return but there were three in the Dynamo graph due to two dead cell variables not being pruned away. This PR adds a corrected algorithm. "new cell variables" are alive if they can be reached from one of the following: 1. any of the tx.symbolic_locals or tx.stack (that is, if they are involved in a return from the function or intermediate variable during a graph break). Example: an alive NestedUserFunctionVariable 2. "mutations to pre-existing objects". Example: appending a NestedUserFunctionVariable to a global list The new algorithm reflects this, but please let me know if there are more cases to handle. Test Plan: - existing tests (afaict, test/dynamo/test_python_autograd is the best SideEffects test case we have) - see in test/dynamo/test_higher_order_ops that the expecttests changed -- the functorch dynamo graphs no longer return dead cellvars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028 Approved by: https://github.com/jansel	2024-06-12 22:25:37 +00:00
cyy	3008644297	[Caffe2] Remove remaining unused perfkernels (#128477 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-12 22:19:36 +00:00
Sam Larsen	55a6b38f52	[inductor] enable fx graph cache on torchbench (#128239 ) Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239 Approved by: https://github.com/oulgen	2024-06-12 22:15:02 +00:00
Huy Do	6206da55ef	Fix lint after #119459 (#128558 ) TSIA Pull Request resolved: https://github.com/pytorch/pytorch/pull/128558 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet	2024-06-12 22:11:37 +00:00
Animesh Jain	2b28b107db	[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128453 Approved by: https://github.com/yf225 ghstack dependencies: #126578, #128440, #128470	2024-06-12 22:03:45 +00:00
James Wu	6aef2052ea	Save backward graphs lazily to cache (#126999 ) This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999 Approved by: https://github.com/bdhirsh ghstack dependencies: #126791	2024-06-12 21:58:34 +00:00
rzou	87072dcfdb	Change Dynamo's custom ops warning message to be less spammy (#128456 ) This is a short-term fix (for 2.4). In the longer term we should fix https://github.com/pytorch/pytorch/issues/128430 The problem is that warnings.warn that are inside Dynamo print all the time. Python warnings are supposed to print once, unless their cache is reset: Dynamo ends up resetting that cache everytime it runs. As a workaround we provide our own warn_once cache that is keyed on the warning msg. I am not worried about this increasing memory usage because that's effectively what python's warnings.warn cache does. Test Plan: - fix tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456 Approved by: https://github.com/anijain2305	2024-06-12 21:57:12 +00:00
haozhe.zhu	c53d65b3d3	[inductor] fix linear add bias pattern (#128473 ) Fix https://github.com/pytorch/pytorch/issues/128287. Previous the assertion in `linear_add_bias` are pretty bad ``` assert packed_weight_node.name == "_reorder_linear_weight" assert transpose_weight_node.name == "permute_default" ``` because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute. Check `target` instead `name` can solve this issue. UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473 Approved by: https://github.com/jgong5	2024-06-12 21:55:35 +00:00
Kurman Karabukaev	bb13fad7aa	Share TCPStore by default when using c10d rdzv handler (#128096 ) Summary: Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail. Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server. Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler. Any handler would like to manage tcp store has to: - Return true on `use_agent_store` property - `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call) Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change) Test Plan: `cat ~/workspace/dist-demo/stores.py` ~~~ import torch import logging import sys import torch.distributed as dist import torch import os import time logger = logging.getLogger(__name__) logger.addHandler(logging.StreamHandler(sys.stderr)) logger.setLevel(logging.INFO) def _run_test(store): if dist.get_rank() == 1: logger.info("Rank %s is sleeping", dist.get_rank()) time.sleep(5) key = "lookup_key" logger.info("Checking key %s in store on rank %s", key, dist.get_rank()) store.check([key]) else: logger.info("rank %s done", dist.get_rank()) def main() -> None: use_gpu = torch.cuda.is_available() dist.init_process_group(backend="nccl" if use_gpu else "gloo") dist.barrier() logger.info(f"Hello World from rank {dist.get_rank()}") host = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] world_size = os.environ['WORLD_SIZE'] logger.info("testing TCPStore") store = dist.TCPStore( host_name=host, port=int(port), world_size=int(world_size), ) _run_test(store) if __name__ == "__main__": main() ~~~ With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option) ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: *************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ************************************* Hello World from rank 1 Hello World from rank 2 Hello World from rank 0 testing TCPStore testing TCPStore testing TCPStore rank 2 done Rank 1 is sleeping rank 0 done Checking key lookup_key in store on rank 1 ~~~ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro c-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: ************************************* Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *************************************** Hello World from rank 0 Hello World from rank 2 Hello World from rank 1 testing TCPStore testing TCPStore testing TCPStore rank 0 done rank 2 done Rank 1 is sleeping Checking key lookup_key in store on rank 1 [rank1]: Traceback (most recent call last): [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module> [rank1]: main() [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main [rank1]: _run_test(store) [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test [rank1]: store.check([key]) [rank1]: torch.distributed.DistNetworkError: Connection reset by peer E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python Traceback (most recent call last): File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module> main() File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main run(args) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run elastic_launch( File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/kurman/workspace/dist-demo/stores.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-05_17:40:22 host : devgpu011.cln5.facebook.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 2279237) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ~~~ Differential Revision: D58180193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096 Approved by: https://github.com/shuqiangzhang	2024-06-12 21:49:42 +00:00
Michael Lazos	c0ea8fc3a3	Disable inlining nn modules on static inputs tests (#128529 ) With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529 Approved by: https://github.com/anijain2305 ghstack dependencies: #128528	2024-06-12 21:40:29 +00:00
Michael Lazos	ff3ba99320	Disable inline nn modules on unstable ptr test (#128528 ) With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528 Approved by: https://github.com/anijain2305	2024-06-12 21:40:29 +00:00
Andrea Frittoli	1026b7cfbe	Add docstring for the torch.typename function (#128129 ) Fixes: #127885 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128129 Approved by: https://github.com/malfet	2024-06-12 21:34:20 +00:00
Aaron Orenstein	cba840fde9	Fix accidental variable shadow (#128460 ) Fixes #128322 We should probably crank up clang's warning levels... Test: ``` import torch def addmv_slice(input, mat, vec, slice_op): vec = vec[slice_op] res = torch.addmv(input, mat, vec) # traced line: 25 return res torch._dynamo.reset() model_opt = torch.compile(addmv_slice) input = torch.empty(size=[11]).uniform_(-1, 1) mat = torch.empty([11, 128]).uniform_(-10.0, 20.0) vec = torch.empty([256]).uniform_(-10.0, 20.0) slice_op = slice(None, None, 2) out = model_opt(input, mat, vec, slice_op) vec = torch.empty([384]).uniform_(-10.0, 20.0) slice_op = slice(None, None, 3) out = model_opt(input, mat, vec, slice_op) ``` before this change the test fails with: ``` torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function getitem>((FakeTensor(..., size=(s0,)), slice(None, None, s1)), *{}): slice step cannot be zero ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128460 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 21:14:04 +00:00
Zhengxu Chen	0444e89931	[export] Remove replace_sym_size_ops_pass (#128443 ) Summary: Not needed anymore. Test Plan: CI Differential Revision: D58429458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128443 Approved by: https://github.com/angelayi	2024-06-12 21:03:06 +00:00
Joel Schlosser	67e6c76a18	Support apply_(callable) sugar for CPU NJTs (#125416 ) Example: ```python nt.apply_(lambda x: x * 2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125416 Approved by: https://github.com/soulitzer	2024-06-12 20:30:57 +00:00
Xuehai Pan	dd143d44cc	[BE] enable UFMT for top-level files `torch/*.py` (#127707 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127707 Approved by: https://github.com/ezyang	2024-06-12 20:15:05 +00:00
James Wu	cc231a8e2b	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 20:04:44 +00:00
Wanchao Liang	7775fee10f	[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 ) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes https://github.com/pytorch/pytorch/issues/128365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431 Approved by: https://github.com/awgu	2024-06-12 19:16:33 +00:00
Joel Schlosser	ec1fdda196	Fix jagged NT softmax semantics (#119459 ) Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong) After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459 Approved by: https://github.com/soulitzer	2024-06-12 19:12:03 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit 4c971932e839fc5da2b91906ad028d4654932bca. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
DanilBaibak	6d1b1ddd3e	Select Runner Label Dynamically (#127287 ) Updated `get_workflow_type.py` logic to dynamically select a prefix for the runner label. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127287 Approved by: https://github.com/ZainRizvi	2024-06-12 18:47:47 +00:00
PyTorch MergeBot	7db501ba2b	Revert "[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350 )" This reverts commit 45dccfddcd8fce804f50075484421ade27f1f021. Reverted https://github.com/pytorch/pytorch/pull/128350 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128350#issuecomment-2163669538))	2024-06-12 18:35:18 +00:00
mori360	d71f92213c	[DSD] keep 'exp_avg' as DTensor after torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#128004 ) Fixes #126950 `ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict` Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004 Approved by: https://github.com/fegin	2024-06-12 18:14:56 +00:00
Tharindu Patabandi	624e8ae491	Documentation for is_dependent function (#128197 ) Docstring for torch.distributions.constraints.is_dependent Fixes #127900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128197 Approved by: https://github.com/fritzo, https://github.com/malfet	2024-06-12 17:50:41 +00:00
Shashank Shekhar	a70a7337d2	Update torch.nanmean() docstring to mention input dtype requirement (#128155 ) Fixes #120570 ## Description Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex. Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated. ## Checklist - [X] The issue that is being fixed is referred in the description. - [X] Only one issue is addressed in this pull request. - [x] Labels from the issue that this PR is fixing are added to this pull request. - [X] No unnecessary issues are included into this pull request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155 Approved by: https://github.com/malfet	2024-06-12 17:46:36 +00:00
anandptl84	0f52dc7e51	Document `torch.cuda.profiler.stop` (#128196 ) Fixes #127918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128196 Approved by: https://github.com/malfet, https://github.com/eqy	2024-06-12 17:39:43 +00:00
PyTorch MergeBot	5001f41b90	Revert "Make TraceUtils.h to be device-agnostic (#126969 )" This reverts commit 648625b230e8e6e7478fb219ff4f0aa6a45070f5. Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))	2024-06-12 16:32:57 +00:00
PyTorch MergeBot	f89574fa23	Revert "Pass params to dump_nccl_trace_pickle (#128307 )" This reverts commit eb567b1f40233667b982f81e3a75deec0fdfd9ca. Reverted https://github.com/pytorch/pytorch/pull/128307 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert 126969 ([comment](https://github.com/pytorch/pytorch/pull/128307#issuecomment-2163459399))	2024-06-12 16:29:51 +00:00
PyTorch MergeBot	81e4e12f02	Revert "Support aten operations with out tensor (#124926 )" This reverts commit cba195c8edd6c7149036ef0767772d11fff5390e. Reverted https://github.com/pytorch/pytorch/pull/124926 on behalf of https://github.com/clee2000 due to newly added test broke in internal D58444103. Test passed in OSS CI though ([comment](https://github.com/pytorch/pytorch/pull/124926#issuecomment-2163441547))	2024-06-12 16:20:04 +00:00
PyTorch MergeBot	c5172b8de8	Revert "[AOTI] Switch to use shim v2 (#127674 )" This reverts commit 9a38cae299e5ffd8143182bec878c28f96cfd72a. Reverted https://github.com/pytorch/pytorch/pull/127674 on behalf of https://github.com/clee2000 due to tests failed internally D56709309 ([comment](https://github.com/pytorch/pytorch/pull/127674#issuecomment-2163436728))	2024-06-12 16:17:07 +00:00
Xu Han	9e39c62908	correct avx512_vnni isa name. (#128318 ) `x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`. This PR correct the function name to `avx512_vnni`. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-06-12 16:12:49 +00:00
PyTorch MergeBot	f2dcbe89d6	Revert "Prevent expansion of cat indexing to avoid int64 intermediate (#127815 )" This reverts commit 793df7b7cb1473004837f5867f4c1c4b2b0f751d. Reverted https://github.com/pytorch/pytorch/pull/127815 on behalf of https://github.com/clee2000 due to the newly added test is failing internally D58444153. Test exists in opensource and passed in OSS CI, maybe env difference? ([comment](https://github.com/pytorch/pytorch/pull/127815#issuecomment-2163421968))	2024-06-12 16:09:22 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Jeff Daily	b19c2319e4	[ROCm] TunableOp for gemm_and_bias (#128143 ) Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap. This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked. CC @xw285cornell Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143 Approved by: https://github.com/Skylion007	2024-06-12 15:53:39 +00:00
Aaron Orenstein	3c971d2ef3	Flip default value for mypy disallow_untyped_defs [final] (#127836 ) Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types. The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 15:28:42 +00:00
PyTorch MergeBot	15ab636007	Revert "Fix side effect pruning (#128028 )" This reverts commit a55d0d9718c11eb2897423c78eff18b168dd0a06. Reverted https://github.com/pytorch/pytorch/pull/128028 on behalf of https://github.com/clee2000 due to broke test in internal D58443816. Test exists in external too though ([comment](https://github.com/pytorch/pytorch/pull/128028#issuecomment-2163249251))	2024-06-12 14:55:57 +00:00
Wu, Chunyuan	5ef70faaa7	Revert "Make torch_geometric models compatible with export (#123403 )" (#128377 ) This reverts commit d78991a7381adb3df5e9b63c365db4506643edce. This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377 Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire	2024-06-12 14:53:01 +00:00
PyTorch MergeBot	71f491554c	Revert "First version of AOTAutogradCache (#126791 )" This reverts commit abc3eec22d38079bee855fbcb75da62a9558284c. Reverted https://github.com/pytorch/pytorch/pull/126791 on behalf of https://github.com/DanilBaibak due to The changes broke a number of linux jobs ([comment](https://github.com/pytorch/pytorch/pull/126791#issuecomment-2163081643))	2024-06-12 13:59:29 +00:00
James Wu	abc3eec22d	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 13:44:30 +00:00
Xia, Weiwen	2e065f2486	[Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#127592 ) Fixes #127402 - Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py - Add checks of mutation for QLinearPointwiseBinaryPT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592 Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee	2024-06-12 10:49:16 +00:00
Xuehai Pan	46a35a1ed4	[BE] enable UFMT for `torch/__init__.py` (#127710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127710 Approved by: https://github.com/ezyang ghstack dependencies: #127703, #127708, #127709	2024-06-12 10:40:23 +00:00
Xuehai Pan	26433b86de	[BE][Easy] sort `__all__` in `torch/__init__.py` (#127709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127709 Approved by: https://github.com/ezyang ghstack dependencies: #127703, #127708	2024-06-12 10:21:36 +00:00
Tom Ritchford	2386045e4f	Add OpInfo entry for alias_copy (#127232 ) (#128142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142 Approved by: https://github.com/lezcano	2024-06-12 09:39:58 +00:00
Jiong Gong	1edcb31d34	[RELAND][inductor][cpp] bf16/fp16 gemm template computed with fp32 (#128472 ) reland for https://github.com/pytorch/pytorch/pull/126068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128472 Approved by: https://github.com/desertfire	2024-06-12 08:37:16 +00:00
Animesh Jain	ebb00a92bd	[dynamo] Skip freezing expect failure for inlining inbuilt nn modules (#128470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128470 Approved by: https://github.com/mlazos ghstack dependencies: #126578, #128440	2024-06-12 08:21:50 +00:00
Animesh Jain	1602c7d0c8	[dynamo] Enable some inlining inbuilt nn module tests (#128440 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #126578	2024-06-12 08:21:50 +00:00
Xuehai Pan	04037f3d22	[BE] sort imports in `torch/__init__.py` (#127708 ) ---- - Sort import via `usort` - Change relative import `from . import xxx` to absolute import `from torch import xxx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127708 Approved by: https://github.com/ezyang ghstack dependencies: #127703	2024-06-12 08:03:54 +00:00
Eddie Yan	0b331fd5d7	[CUDA] Abate `SoftMax.cu` compiler warning spam (#128468 ) Avoids excessively spammy warnings such as ``` pytorch/aten/src/ATen/native/cuda/SoftMax.cu(844): warning #191-D: type qualifier is meaningless on cast type [&] { const auto& the_type = input.scalar_type(); constexpr const char* at_dispatch_name = "host_softmax"; at::ScalarType _st = ::detail::scalar_type(the_type); ; switch (_st) { case at::ScalarType::Double: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Double)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Double), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Double>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Float: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Float)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Float), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Float>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Half: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Half)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Half), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Half>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::BFloat16: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::BFloat16)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::BFloat16), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::BFloat16>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } default: do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str('"', at_dispatch_name, "\" not implemented for '", toString(_st), "'")))); }; } while (false); } }() ``` and ``` SoftMax.cu:844: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘long unsigned int’ [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128468 Approved by: https://github.com/valentinandrei	2024-06-12 07:47:14 +00:00
Sam Larsen	8b3daf1768	Add FloatTrueDiv and ToFloat to SYMPY_INTERP (#128418 ) Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this: ``` Ne(CeilToInt(FloatTrueDiv(ToFloat(8L['t0']) - 4.0, 8.0))CeilToInt(FloatTrueDiv(ToFloat(8L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8L['t1']) - 4.0, 8.0))) and ... ``` It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any? Test Plan: `TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418 Approved by: https://github.com/ezyang	2024-06-12 06:26:43 +00:00
PyTorch MergeBot	a421699998	Revert "[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 )" This reverts commit 089f9a116ac8b2c14d6351b52614b529caba126b. Reverted https://github.com/pytorch/pytorch/pull/128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - `089f9a116a` ([comment](https://github.com/pytorch/pytorch/pull/128431#issuecomment-2162197858))	2024-06-12 06:25:53 +00:00
Xuehai Pan	dcc0093dba	[BE][Easy] export explicitly imported public submodules (#127703 ) Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703 Approved by: https://github.com/ezyang	2024-06-12 05:52:18 +00:00
diwei sun	62311257ad	Add 1 test case for Convtranspose1D in op microbenchmark (#127216 ) Operator Convtransposd1d suffers performance regression with specific shape, #120982. Then we'd like to have this shape included into op level benchmark in this PR. I reproduced the regression that convtranspos1d with shape [2016, 1026, 1024, 256, 1, 224]. Here is the summary: Hardware info: Intel SPR8480-56cores per socket with frequency=2.1G. Performance comparison between torch 1.13 vs. torch 2.2 Benchmarking PyTorch1.13: ConvTranspose1d Mode: Eager Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu Forward Execution Time (s) : 0.96s Benchmarking PyTorch2.2: ConvTranspose1d Mode: Eager Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu Forward Execution Time (s) : 7.988s Also benchmarking for 7 rounds to check the variance. \| Round1 \| Round2 \| Round3 \| Round4 \| Round5 \| Round6 \| Round7 \| Normalized Variance -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Pytorch1.13 \| 0.971 \| 0.972 \| 0.969 \| 0.970 \| 0.972 \| 0.970 \| 0.971 \| 0.0002% Pytorch 2.2 \| 8.064 \| 8.053 \| 8.027 \| 7.927 \| 7.971 \| 7.929 \| 7.902 \| 0.0059% Ratio v2.2 vs. v1.13(Lower is better) \| 8.31 \| 8.28 \| 8.29 \| 8.18 \| 8.20 \| 8.18 \| 8.14 \| Reproduce script： numctl -N 0 python -m pt.conv_test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127216 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman	2024-06-12 05:33:54 +00:00
Wanchao Liang	089f9a116a	[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 ) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes https://github.com/pytorch/pytorch/issues/128365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431 Approved by: https://github.com/awgu	2024-06-12 05:22:24 +00:00
Natalia Gimelshein	77a0ca66e4	Add threadfence to 2-stage reduction for correct writes visibility (#128455 ) Final block accumulating 2-stage reduction result has to complete acquire pattern to make sure the writes of all other blocks are visible to it, see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=atom#release-and-acquire-patterns Pull Request resolved: https://github.com/pytorch/pytorch/pull/128455 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-12 04:13:36 +00:00
Animesh Jain	c0b87afcad	[RELAND2][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 ) Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically. Fixes https://github.com/pytorch/pytorch/issues/111837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578 Approved by: https://github.com/jansel	2024-06-12 04:09:23 +00:00
loganthomas	02e7519ac3	DOC: strip inaccurate either float32 or float64 statement from set_default_type (#128192 ) Fixes #126647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128192 Approved by: https://github.com/malfet	2024-06-12 03:57:48 +00:00
cyy	8cf302dce4	[5/N] Change static functions in headers to inline (#128406 ) Follows #128286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128406 Approved by: https://github.com/ezyang	2024-06-12 03:25:54 +00:00
Kazuaki Ishizaki	86b5df3e71	Documenting the torch.fx.annotate.annotate function (#128337 ) Fixes #127903 This PR adds docstring to the `torch.fx.annotate.annotate` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128337 Approved by: https://github.com/malfet	2024-06-12 03:06:32 +00:00
Tuan Trieu	7c2058338a	Improve convert fp32 to fp16 fx pass (#127829 ) Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion. Test Plan: ``` buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder ``` Differential Revision: D57803843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829 Approved by: https://github.com/Skylion007	2024-06-12 02:50:37 +00:00
PyTorch MergeBot	3ddec713b8	Revert "[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177 )" This reverts commit cac7a22b92478d897488688010e562b7bd36b97f. Reverted https://github.com/pytorch/pytorch/pull/128177 on behalf of https://github.com/clee2000 due to broke test/test_quantization.py::TestQuantizedLinear::test_qlinear_cudnn on sm86 tests `cac7a22b92` https://github.com/pytorch/pytorch/actions/runs/9470648757/job/26100448913. Probably a landrace, test ran on the PR and succeed ([comment](https://github.com/pytorch/pytorch/pull/128177#issuecomment-2161977110))	2024-06-12 02:20:15 +00:00
William Wen	85eeb90d2c	[dynamo] Fix graph breaks related to HF ModelOutput (#127780 ) Fixes https://github.com/pytorch/pytorch/issues/126028 and https://github.com/pytorch/pytorch/issues/126027. Changes: - Support building `CustomizedDictVariable` in` VariableBuilder` (but only for HF `ModelOutput` subclasses) - Remove `DataClassVariable` since it's not really being used anywhere (`CustomizedDictVariable` can be used instead) - Support side effects for `CustomizedDictVariable` - Allow `NO_HASATTR` leaf guard on `DictSubclassGuardManager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127780 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-06-12 02:16:24 +00:00
Sam Larsen	7f6daf289b	[inductor] parallel compile: set LD_LIBRARY_PATH for sub-processes in internal (#128376 ) Test Plan: `TORCHINDUCTOR_WORKER_START=subprocess TORCHINDUCTOR_COMPILE_THREADS=16 buck run mode/opt scripts/slarsen/torch_compile:run` Differential Revision: D58371264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128376 Approved by: https://github.com/eellison	2024-06-12 01:55:53 +00:00
Jiashen Cao	3d55d84ec2	[Fix] Check tensor dtype before using torch.allclose in _trace log (#128438 ) #### Issue `torch.allclose` errors out during logging due to different dtypes. #### Test * `pytest test/test_jit.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128438 Approved by: https://github.com/angelayi	2024-06-12 01:52:09 +00:00
Wei Chen	bb2a995529	Back out "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466 )" (#128432 ) Summary: Original commit changeset: c7d2e6b13922 Original Phabricator Diff: D57618942 Differential Revision: D58383241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128432 Approved by: https://github.com/ezyang, https://github.com/Yuzhen11	2024-06-12 01:34:32 +00:00
cyy	9538bf4e7c	[2/N] Remove inclusion of c10/util/string_utils.h (#128372 ) Follows #128300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372 Approved by: https://github.com/aaronenyeshi	2024-06-12 01:18:20 +00:00
cyy	219da29dfd	[7/N] Remove unused functions (#128407 ) Follows #128309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407 Approved by: https://github.com/ezyang	2024-06-12 01:10:33 +00:00
cyy	fb013ecb24	Remove unused private List::ptr_to_first_element (#128405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405 Approved by: https://github.com/ezyang	2024-06-12 01:07:14 +00:00
Kurman Karabukaev	6af4c6acad	Migrate test to internal base class, fixes (#128367 ) Summary: ## Remove etc deps converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server ## Adopt pytorch test convetions - test starts with `test_TESTS.py` - Test base class is torch.testing._internal.common_utils.TestCase - include __main__ handler ## reduce test timing (used to take > 300 seconds): 3.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic 2.59s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path 2.30s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched 2.17s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents 2.12s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic 2.08s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations 1.32s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc 1.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash 1.03s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc 0.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown Test Plan: pytest --durations=0 test/distributed/launcher/run_test.py Differential Revision: D58388182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367 Approved by: https://github.com/d4l3k	2024-06-12 01:03:40 +00:00
Bin Bao	786c24a4cd	[inductor] Always realize sigmoid for CPU (#128339 ) Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339 Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10	2024-06-12 00:46:33 +00:00
PyTorch MergeBot	5d8c7f39d4	Revert "Introduce int_oo (#127693 )" This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c. Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))	2024-06-11 23:36:08 +00:00
PyTorch MergeBot	c9c1fed065	Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 )" This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63. Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))	2024-06-11 23:34:03 +00:00
Andrew Hoblitzell	94fea82d66	init sub comment (#128082 ) Fixes #127905 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082 Approved by: https://github.com/titaiwangms	2024-06-11 22:42:35 +00:00
Andrea Frittoli	447173198b	Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139 ) Fixes: #127916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139 Approved by: https://github.com/SherlockNoMad	2024-06-11 22:42:11 +00:00
angelayi	b79d056e76	[export] FIx unflattener for preserving modules containing unused inputs (#128260 ) Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs. This also fixes unflattener issues in D57829276. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260 Approved by: https://github.com/pianpwk	2024-06-11 22:32:08 +00:00
Chirag Pandya	eb567b1f40	Pass params to dump_nccl_trace_pickle (#128307 ) Summary: Pass parameters from request to dump_nccl_trace_pickle handler. The supported parameters + value are all lowercase. includecollectives={true, false} includestacktraces={true, false} onlyactive={true, false} Example post is: /handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true Test Plan: unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307 Approved by: https://github.com/d4l3k ghstack dependencies: #128191	2024-06-11 22:28:53 +00:00
Chirag Pandya	1dd2431f86	[Test] Add test for only_active flag (#128191 ) Summary: Add a unit test for the only_active flag to _dump_nccl_trace API call. With this flag, we only expect active records to be returned. Test Plan: Unit test. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191 Approved by: https://github.com/d4l3k	2024-06-11 22:26:01 +00:00
Andrew Hoblitzell	5fcb5f0c8b	init reshape_from_tensor_shape comment (#128171 ) Fixes #127897 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171 Approved by: https://github.com/titaiwangms	2024-06-11 21:56:33 +00:00
rzou	a55d0d9718	Fix side effect pruning (#128028 ) Summary: The previous side effect pruning algorithm would keep many dead cell variables alive. For example, in https://github.com/pytorch/pytorch/issues/125078, the compiled function has one return but there were three in the Dynamo graph due to two dead cell variables not being pruned away. This PR adds a corrected algorithm. "new cell variables" are alive if they can be reached from one of the following: 1. any of the tx.symbolic_locals or tx.stack (that is, if they are involved in a return from the function or intermediate variable during a graph break). Example: an alive NestedUserFunctionVariable 2. "mutations to pre-existing objects". Example: appending a NestedUserFunctionVariable to a global list The new algorithm reflects this, but please let me know if there are more cases to handle. Test Plan: - existing tests (afaict, test/dynamo/test_python_autograd is the best SideEffects test case we have) - see in test/dynamo/test_higher_order_ops that the expecttests changed -- the functorch dynamo graphs no longer return dead cellvars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028 Approved by: https://github.com/jansel	2024-06-11 21:40:48 +00:00
Andrew Gu	8c1247cffb	[BE] Fixed CPU autocast warning (#127774 ) This PR fixes ``` /data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774 Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l	2024-06-11 21:33:35 +00:00
Will Feng	70a1e85718	[Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856 ) Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856 Approved by: https://github.com/awgu	2024-06-11 20:15:03 +00:00
PyTorch MergeBot	adb699189b	Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 )" This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9. Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084. Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))	2024-06-11 19:41:41 +00:00
eqy	45dccfddcd	[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350 ) CC @vedaanta-nvidia @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350 Approved by: https://github.com/Skylion007	2024-06-11 19:22:21 +00:00
yuqingj	3e09123797	Enable UFMT on test_nestedtensor.py (#128359 ) split it into two PRs since it is more than 2k lines of change Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359 Approved by: https://github.com/davidberard98	2024-06-11 19:14:04 +00:00
BowenBao	61f922c2ca	Fix 'get_real_value' on placeholder nodes (#127698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698 Approved by: https://github.com/jansel ghstack dependencies: #127695, #127696	2024-06-11 18:57:25 +00:00
BowenBao	984b1a8c35	Fix 'get_attr' call in dynamo 'run_node' (#127696 ) Fixes #124858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696 Approved by: https://github.com/jansel ghstack dependencies: #127695	2024-06-11 18:57:25 +00:00
Jing Xu	205410cb44	add xpu to torch.tensors (#127280 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280 Approved by: https://github.com/svekars	2024-06-11 18:13:01 +00:00
Eddie Yan	cac7a22b92	[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177 ) Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166 CC @nWEIdia @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-06-11 18:09:25 +00:00
Wanchao Liang	8a09940a54	[inductor] fix compile time regression by caching get_gpu_type (#128363 ) We observed signficant compile time regression in torchtitan when turning on 2D parallel + torch.compile recently. So I decided to get a deeper understanding why. It turns out this is affecting all the trainings that have functional collectives captured in the graph, not only 2D parallel (2D parallel was just the job that happen to have collectives captured in the TP region). The root cause is because when doing inductor lowering, we are calling the comm analysis pass to get a estimated collective time for each collective node in the graph, for each call to check the collective node, we are calling `get_gpu_type()`, which under the hood calls a `torch.utils.collect_env.run` to get the GPU info. However, this call is super expensive! The reason is that this call effectively spawns a new process and call `nvidia-smi` to get the GPU info, so the cost is linear to the number of collective nodes in the graph. see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75 The fix is to add a lru cache to the function, so that we only call this once and reuse the cached results afterwards torchtitan benchmark shows: * before this fix: 2D parallel + fp8 compile time: 6min + * after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement) There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363 Approved by: https://github.com/yf225	2024-06-11 18:02:13 +00:00
PyTorch MergeBot	1d233b8f50	Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704 )" This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165. Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
PyTorch MergeBot	491c4a5dcb	Revert "Make sure #126704 is BC for torch.save-ed `nn.Module` (#128344 )" This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb. Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
Angela Yi	4345d98663	[dynamo] Fix for #127696 (#128358 ) Test Plan: `buck2 test @//mode/dev-nosan //executorch/exir/backend/...` https://www.internalfb.com/intern/testinfra/testrun/12666373989243932 Differential Revision: D58384518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358 Approved by: https://github.com/ydwu4	2024-06-11 16:43:15 +00:00
ankurneog	a838e90964	Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970 ) ### Motivation Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations. Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming. Hence with this PR we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded. The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices ### Changes Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU. Include code to check if intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests ### Additional Context please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970 Approved by: https://github.com/albanD	2024-06-11 16:35:17 +00:00
David Berard	29081059b6	[Static Runtime] Fix & run gen_static_runtime_ops (#128299 ) gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise. I added a number of ops to the blocklist: ``` + "_nested_tensor_storage_offsets", + "_nested_get_values", # no CPU backend + "_nested_get_values_copy", # no CPU backend + "_nested_view_from_jagged", # testing needs to be patched + "_nested_view_from_jagged_copy", # testing needs to be patched + "_nested_view_from_buffer", # testing needs to be patched + "_nested_view_from_buffer_copy", # testing needs to be patched + "_int_mm", # testing needs to be patched + "_to_sparse_csc", # testing needs to be patched + "_to_sparse_csr", # testing needs to be patched + "segment_reduce", # testing needs to be patched ``` Most of these are added just because testing doesn't work right now. Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though. Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299 Approved by: https://github.com/YuqingJ	2024-06-11 16:27:39 +00:00
Nikita Shulga	f8c45996d5	[MPS] Make erfinv compilable for bfloat16 (#128375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375 Approved by: https://github.com/Skylion007 ghstack dependencies: #128373	2024-06-11 16:04:11 +00:00
Aaron Orenstein	c13e03c874	Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374 Approved by: https://github.com/Skylion007	2024-06-11 15:58:28 +00:00
Nikita Shulga	053930e194	[MPS][BE] Remove code duplication (#128373 ) Use `scalarToMetalTypeString` instead of `getMetalType` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373 Approved by: https://github.com/Skylion007	2024-06-11 15:58:04 +00:00
Huamin Li	9a38cae299	[AOTI] Switch to use shim v2 (#127674 ) Differential Revision: D56709309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674 Approved by: https://github.com/desertfire	2024-06-11 15:01:25 +00:00
kareem mohiddeen shaik	55901fb3da	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/ezyang	2024-06-11 14:04:52 +00:00
IvanKobzarev	fc77fdca6f	[guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224 Approved by: https://github.com/ezyang	2024-06-11 14:01:34 +00:00
FFFrog	648625b230	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-11 08:38:07 +00:00
Peter Bell	207c2248a8	[inductor] Fix lowering full with SymBool value (#128213 ) Fixes #128161, fixes #128095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213 Approved by: https://github.com/lezcano	2024-06-11 08:33:35 +00:00
Colin L Reliability Rice	a206dcc79e	fb_memcache: Move to fbcode from thirdparty (#128174 ) Summary: The fb_memcache injections location and path is changing. Test Plan: Existing tests should pass. Reviewed By: bertmaher, oulgen Differential Revision: D57973772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174 Approved by: https://github.com/oulgen	2024-06-11 07:46:12 +00:00
Animesh Jain	f2d7f235a6	[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 ) Fixes https://github.com/pytorch/pytorch/issues/101168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269 Approved by: https://github.com/jansel ghstack dependencies: #128295, #126578, #128268, #128254	2024-06-11 07:09:04 +00:00
Michael Lazos	402b289f3b	Properly register parameter for binary folding test (#128356 ) This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356 Approved by: https://github.com/anijain2305 ghstack dependencies: #128355	2024-06-11 06:48:26 +00:00
Michael Lazos	a32157c67c	Mark params static if inlining modules and freezing (#128355 ) Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355 Approved by: https://github.com/anijain2305	2024-06-11 06:48:26 +00:00
Lourenco Matos	24e7f29099	Lowering for avg_pool_3d_backward (Fixes:#127101) (#127722 ) We implemented a lowering for the avg_pool3d_backward operation and created tests for it. We ran some benchmarks and achieved the following results: ``` [-------------- avgpool_3d_backwards --------------] \| Decomposed \| Eager 16 threads: ---------------------------------------- (3, 5, 400, 200, 200) \| 6061 \| 11160 (3, 5, 300, 200, 200) \| 4547 \| 8372 (3, 5, 200, 200, 200) \| 3032 \| 5585 (3, 5, 300, 300, 300) \| 10100 \| 18840 (3, 5, 100, 100, 100) \| 381 \| 703 (3, 5, 100, 300, 200) \| 2270 \| 4190 (8, 8, 128, 128, 128) \| 3397 \| 6253 (2, 3, 150, 150, 150) \| 520 \| 947 (1, 3, 128, 128, 128) \| 161 \| 299 (8, 16, 64, 64, 64) \| 851 \| 1569 (1, 1, 50, 50, 50) \| 17 \| 11 (3, 5, 20, 40, 40) \| 17 \| 30 (3, 5, 10, 20, 20) \| 17 \| 11 (1, 1, 10, 10, 10) \| 16 \| 11 (3, 5, 5, 10, 10) \| 17 \| 11 (3, 5, 2, 5, 5) \| 17 \| 11 ``` These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations. We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes. Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication. We diffed the kernels and they are identical. Fixes #127101 Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722 Approved by: https://github.com/jansel	2024-06-11 06:39:04 +00:00
Oguz Ulgen	5b5d269d34	Speed up fx graph iteration by implementing it in C++ (#128288 ) Before this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s) ``` After this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s) ``` 5.7x improvement Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-11 05:48:31 +00:00

6279 changed files with 353077 additions and 186326 deletions

3

.buckconfig.oss

View File

 @ -21,6 +21,3 @@
   cxx = /usr/bin/clang++
   cxxpp = /usr/bin/clang++
   ld = /usr/bin/clang++
 [project]
   default_flavors_mode=all

									
										10

.ci/docker/README.md
									
												View File
												
				@ -1,4 +1,4 @@

				# Docker images for GitHub CI

				# Docker images for GitHub CI and CD

				This directory contains everything needed to build the Docker images

				that are used in our CI.

				@ -12,7 +12,7 @@ each image as the `BUILD_ENVIRONMENT` environment variable.

				See `build.sh` for valid build environments (it's the giant switch).

				## Contents

				## Docker CI builds

				* `build.sh` -- dispatch script to launch all builds

				* `common` -- scripts used to execute individual Docker build stages

				@ -21,6 +21,12 @@ See `build.sh` for valid build environments (it's the giant switch).

				* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support

				* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support

				### Docker CD builds

				* `conda` - Dockerfile and build.sh to build Docker images used in nightly conda builds

				* `manywheel` - Dockerfile and build.sh to build Docker images used in nightly manywheel builds

				* `libtorch` - Dockerfile and build.sh to build Docker images used in nightly libtorch builds

				## Usage

				```bash

8

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +1,5 @@
 .6b
 .7b
 manylinux_2_17
 rocm6
 b5df8c8123f90cba3ede7e971e6fbc6040d506
 db6ecbc915893ff967abd6e1b43bd5f54949868873be60dc802086c3863e648
 rocm6.2
 be04068c3c0857a4cfd17d7e39e71d0423ebac2
 e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

									
										99

.ci/docker/build.sh
									
												View File
												
				@ -92,7 +92,7 @@ _UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				# from scratch

				case "$image" in

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -120,7 +120,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -165,7 +165,7 @@ case "$image" in

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				@ -194,7 +194,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -222,7 +222,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -236,7 +236,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				@ -245,7 +245,7 @@ case "$image" in

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3-clang9-android-ndk-r21e)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=9

				    LLVMDEV=yes

				    PROTOBUF=yes

				@ -254,8 +254,8 @@ case "$image" in

				    GRADLE_VERSION=6.8.3

				    NINJA_VERSION=1.9.0

				    ;;

				  pytorch-linux-focal-py3.8-clang10)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-focal-py3.9-clang10)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				@ -276,8 +276,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.8-gcc9)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-focal-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -286,18 +286,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -307,8 +296,19 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=6.2

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -318,8 +318,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				    pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.8

				    pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -330,8 +330,8 @@ case "$image" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				@ -355,8 +355,14 @@ case "$image" in

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.8-gcc11)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-jammy-py3-clang18-asan)

				    ANACONDA_PYTHON_VERSION=3.10

				    CLANG_VERSION=18

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -373,6 +379,21 @@ case "$image" in

				    CONDA_CMAKE=yes

				    EXECUTORCH=yes

				    ;;

				  pytorch-linux-jammy-py3.12-halide)

				    CUDA_VERSION=12.4

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    HALIDE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.12-triton-cpu)

				    CUDA_VERSION=12.4

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    TRITON_CPU=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				@ -400,6 +421,22 @@ case "$image" in

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    # snadampal: skipping sccache due to the following issue

				    # https://github.com/pytorch/pytorch/issues/121559

				    SKIP_SCCACHE_INSTALL=yes

				    # snadampal: skipping llvm src build install because the current version

				    # from pytorch/llvm:9.0.1 is x86 specific

				    SKIP_LLVM_SRC_BUILD_INSTALL=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				    PROTOBUF=yes

				@ -486,10 +523,12 @@ docker build \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

				       --build-arg "TRITON=${TRITON}" \

				       --build-arg "TRITON_CPU=${TRITON_CPU}" \

				       --build-arg "ONNX=${ONNX}" \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

				       --build-arg "EXECUTORCH=${EXECUTORCH}" \

				       --build-arg "HALIDE=${HALIDE}" \

				       --build-arg "XPU_VERSION=${XPU_VERSION}" \

				       --build-arg "ACL=${ACL:-}" \

				       --build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

									
										4

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -108,10 +108,10 @@ ENV CMAKE_C_COMPILER cc

				ENV CMAKE_CXX_COMPILER c++

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 d4b3e5cc607e97afdba79dc90f8ef968142f347c
 ca4783992ed7602a39528ba304d61f00396b2a5a

1

.ci/docker/ci_commit_pins/halide.txt Normal file

View File

				`@ -0,0 +1 @@`
				`461c12871f336fe6f57b55d6a297f13ef209161b`

2

.ci/docker/ci_commit_pins/timm.txt

View File

 @ -1 +1 @@
 b907b4d45a4713cbc425cbf224c46089fd514
 ac3470188b914c5d7a5058a7e28b9eb685a62427

1

.ci/docker/ci_commit_pins/triton-cpu.txt Normal file

View File

				`@ -0,0 +1 @@`
				`c7711371cace304afe265c1ffa906415ab82fc66`

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

				`@ -1 +0,0 @@`
				`01cbe5045a6898c9a925f01435c8277b2fe6afcc`

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b8c64f64c18d8cac598b3adb355c21e7439c21de
 b14bf5593cf58a8541f3e6b9125600a867d4ef

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 fff310c891f5a92d55445adf8cc9d29df5841e
 cf34004b8a67d290a962da166f5aa2fc66751326

									
										4

.ci/docker/common/install_aotriton.sh
									
												View File
												
				@ -4,12 +4,12 @@ set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.bz2'

				TARBALL='aotriton.tar.gz'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}.tar.bz2"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

									
										11

.ci/docker/common/install_clang.sh
									
												View File
												
				@ -13,11 +13,18 @@ if [ -n "$CLANG_VERSION" ]; then

				  elif [[ $UBUNTU_VERSION == 22.04 ]]; then

				    # work around ubuntu apt-get conflicts

				    sudo apt-get -y -f install

				    wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add  -

				    if [[ $CLANG_VERSION == 18 ]]; then

				      apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"

				    fi

				  fi

				  sudo apt-get update

				  apt-get install -y --no-install-recommends clang-"$CLANG_VERSION"

				  apt-get install -y --no-install-recommends llvm-"$CLANG_VERSION"

				  if [[ $CLANG_VERSION -ge 18 ]]; then

				    apt-get install -y libomp-${CLANG_VERSION}-dev libclang-rt-${CLANG_VERSION}-dev clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"

				  else

				    apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"

				  fi

				  # Install dev version of LLVM.

				  if [ -n "$LLVMDEV" ]; then

									
										40

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -5,32 +5,22 @@ set -ex

				# Optionally install conda

				if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  fi

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				if [[ $(uname -m) == "aarch64" ]]; then

				  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniforge3-Linux-aarch64.sh"

				    ;;

				    3);;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				else

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				    ;;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				fi

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				@ -75,21 +65,9 @@ fi

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				      conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}

				    fi

				    conda_install "openblas==0.3.25=*openmp*"

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				      conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				    fi

				    conda_install "mkl=2021.4.0 mkl-include=2021.4.0"

				  fi

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				@ -113,8 +91,6 @@ fi

				  # Install some other packages, including those needed for Python test reporting

				  pip_install -r /opt/conda/requirements-ci.txt

				  pip_install -U scikit-learn

				  if [ -n "$DOCS" ]; then

				    apt-get update

				    apt-get -y install expect-dev

									
										20

.ci/docker/common/install_conda_docker.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,20 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				# Anaconda

				# Latest anaconda is using openssl-3 which is incompatible with all currently published versions of git

				# Which are using openssl-1.1.1, see https://anaconda.org/anaconda/git/files?version=2.40.1 for example

				MINICONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-py311_23.5.2-0-Linux-x86_64.sh

				wget -q $MINICONDA_URL

				# NB: Manually invoke bash per https://github.com/conda/conda/issues/10431

				bash $(basename "$MINICONDA_URL") -b -p /opt/conda

				rm $(basename "$MINICONDA_URL")

				export PATH=/opt/conda/bin:$PATH

				# See https://github.com/pytorch/builder/issues/1473

				# Pin conda to 23.5.2 as it's the last one compatible with openssl-1.1.1

				conda install -y conda=23.5.2 conda-build anaconda-client git ninja

				# The cmake version here needs to match with the minimum version of cmake

				# supported by PyTorch (3.18). There is only 3.18.2 on anaconda

				/opt/conda/bin/pip3 install cmake==3.18.2

				conda remove -y --force patchelf

									
										112

.ci/docker/common/install_cpython.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,112 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -uex -o pipefail

				PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python

				PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

				        echo "required variable not defined"

				        exit 1

				    fi

				}

				function do_cpython_build {

				    local py_ver=$1

				    local py_folder=$2

				    check_var $py_ver

				    check_var $py_folder

				    tar -xzf Python-$py_ver.tgz

				    local additional_flags=""

				    if [ "$py_ver" == "3.13.0t" ]; then

				        additional_flags=" --disable-gil"

				        mv cpython-3.13/ cpython-3.13t/

				    fi

				    pushd $py_folder

				    local prefix="/opt/_internal/cpython-${py_ver}"

				    mkdir -p ${prefix}/lib

				    if [[ -n $(which patchelf) ]]; then

				        local shared_flags="--enable-shared"

				    else

				        local shared_flags="--disable-shared"

				    fi

				    if [[ -z  "${WITH_OPENSSL+x}" ]]; then

				        local openssl_flags=""

				    else

				        local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"

				    fi

				    # -Wformat added for https://bugs.python.org/issue17547 on Python 2.6

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} ${additional_flags} > /dev/null

				    make -j40 > /dev/null

				    make install > /dev/null

				    if [[ "${shared_flags}" == "--enable-shared" ]]; then

				        patchelf --set-rpath '$ORIGIN/../lib' ${prefix}/bin/python3

				    fi

				    popd

				    rm -rf $py_folder

				    # Some python's install as bin/python3. Make them available as

				    # bin/python.

				    if [ -e ${prefix}/bin/python3 ]; then

				        ln -s python3 ${prefix}/bin/python

				    fi

				    ${prefix}/bin/python get-pip.py

				    if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then

				        ln -s pip3 ${prefix}/bin/pip

				    fi

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    ln -s ${prefix} /opt/python/${abi_tag}

				}

				function build_cpython {

				    local py_ver=$1

				    check_var $py_ver

				    check_var $PYTHON_DOWNLOAD_URL

				    local py_ver_folder=$py_ver

				    if [ "$py_ver" = "3.13.0t" ]; then

				        PY_VER_SHORT="3.13"

				        PYT_VER_SHORT="3.13t"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PYT_VER_SHORT

				    elif [ "$py_ver" = "3.13.0" ]; then

				        PY_VER_SHORT="3.13"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PY_VER_SHORT

				    else

				        wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz

				        do_cpython_build $py_ver Python-$py_ver

				    fi

				    rm -f Python-$py_ver.tgz

				}

				function build_cpythons {

				    check_var $GET_PIP_URL

				    curl -sLO $GET_PIP_URL

				    for py_ver in $@; do

				        build_cpython $py_ver

				    done

				    rm -f get-pip.py

				}

				mkdir -p /opt/python

				mkdir -p /opt/_internal

				build_cpythons $CPYTHON_VERSIONS

									
										319

.ci/docker/common/install_cuda.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,319 @@

				#!/bin/bash

				set -ex

				NCCL_VERSION=v2.21.5-1

				CUDNN_VERSION=9.1.0.70

				function install_cusparselt_040 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_052 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_118 {

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				    # install CUDA 11.8.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run

				    chmod +x cuda_11.8.0_520.61.05_linux.run

				    ./cuda_11.8.0_520.61.05_linux.run --toolkit --silent

				    rm -f cuda_11.8.0_520.61.05_linux.run

				    rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz

				    tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    install_cusparselt_040

				    ldconfig

				}

				function install_121 {

				    echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				    rm -rf /usr/local/cuda-12.1 /usr/local/cuda

				    # install CUDA 12.1.0 in the same container

				    wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run

				    chmod +x cuda_12.1.1_530.30.02_linux.run

				    ./cuda_12.1.1_530.30.02_linux.run --toolkit --silent

				    rm -f cuda_12.1.1_530.30.02_linux.run

				    rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				    cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf tmp_cudnn

				    # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				    # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				    git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				    cd nccl && make -j src.build

				    cp -a build/include/* /usr/local/cuda/include/

				    cp -a build/lib/* /usr/local/cuda/lib64/

				    cd ..

				    rm -rf nccl

				    install_cusparselt_052

				    ldconfig

				}

				function install_124 {

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				  chmod +x cuda_12.4.1_550.54.15_linux.run

				  ./cuda_12.4.1_550.54.15_linux.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_062

				  ldconfig

				}

				function install_126 {

				  echo "Installing CUDA 12.6.2 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.6 /usr/local/cuda

				  # install CUDA 12.6.2 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda_12.6.2_560.35.03_linux.run

				  chmod +x cuda_12.6.2_560.35.03_linux.run

				  ./cuda_12.6.2_560.35.03_linux.run --toolkit --silent

				  rm -f cuda_12.6.2_560.35.03_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_062

				  ldconfig

				}

				function prune_118 {

				    echo "Pruning CUDA 11.8 and cuDNN"

				    #####################################################################################

				    # CUDA 11.8 prune static libs

				    #####################################################################################

				    export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"

				    export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"

				    export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    if [[ -n "$OVERRIDE_GENCODE" ]]; then

				        export GENCODE=$OVERRIDE_GENCODE

				    fi

				    # all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)

				    ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				    # prune CuDNN and CuBLAS

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				    #####################################################################################

				    # CUDA 11.8 prune visual tools

				    #####################################################################################

				    export CUDA_BASE="/usr/local/cuda-11.8/"

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/

				}

				function prune_121 {

				  echo "Pruning CUDA 12.1"

				  #####################################################################################

				  # CUDA 12.1 prune static libs

				  #####################################################################################

				    export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"

				    export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"

				    export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				    if [[ -n "$OVERRIDE_GENCODE" ]]; then

				        export GENCODE=$OVERRIDE_GENCODE

				    fi

				    # all CUDA libs except CuDNN and CuBLAS

				    ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				    # prune CuDNN and CuBLAS

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				    $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				    #####################################################################################

				    # CUDA 12.1 prune visual tools

				    #####################################################################################

				    export CUDA_BASE="/usr/local/cuda-12.1/"

				    rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.4 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				function prune_126 {

				  echo "Pruning CUDA 12.6"

				  #####################################################################################

				  # CUDA 12.6 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then

				      export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.6 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.6/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    11.8) install_118; prune_118

				        ;;

				    12.1) install_121; prune_121

				        ;;

				    12.4) install_124; prune_124

				        ;;

				    12.6) install_126; prune_126

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

				    shift

				done

									
										93

.ci/docker/common/install_cuda_aarch64.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,93 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				NCCL_VERSION=v2.21.5-1

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_124 {

				  echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				  chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run

				  ./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				  mkdir tmp_cudnn && cd tmp_cudnn

				  wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz -O cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz

				  tar xf cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz

				  cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/include/* /usr/local/cuda/include/

				  cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf tmp_cudnn

				  # NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses

				  # Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build

				  git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git

				  cd nccl && make -j src.build

				  cp -a build/include/* /usr/local/cuda/include/

				  cp -a build/lib/* /usr/local/cuda/lib64/

				  cd ..

				  rm -rf nccl

				  install_cusparselt_062

				  ldconfig

				}

				function prune_124 {

				  echo "Pruning CUDA 12.4"

				  #####################################################################################

				  # CUDA 12.4 prune static libs

				  #####################################################################################

				  export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"

				  export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"

				  export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"

				  if [[ -n "$OVERRIDE_GENCODE" ]]; then

				      export GENCODE=$OVERRIDE_GENCODE

				  fi

				  # all CUDA libs except CuDNN and CuBLAS

				  ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis"  \

				      | xargs -I {} bash -c \

				                "echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"

				  # prune CuDNN and CuBLAS

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a

				  $NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a

				  #####################################################################################

				  # CUDA 12.1 prune visual tools

				  #####################################################################################

				  export CUDA_BASE="/usr/local/cuda-12.4/"

				  rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/

				}

				# idiomatic parameter and option handling in sh

				while test $# -gt 0

				do

				    case "$1" in

				    12.4) install_124; prune_124

				        ;;

				    *) echo "bad argument $1"; exit 1

				        ;;

				    esac

				    shift

				done

									
										25

.ci/docker/common/install_cudss.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,25 @@

				#!/bin/bash

				set -ex

				# cudss license: https://docs.nvidia.com/cuda/cudss/license.html

				mkdir tmp_cudss && cd tmp_cudss

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUDSS_NAME="libcudss-linux-${arch_path}-0.3.0.9_cuda12-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudss/redist/libcudss/linux-${arch_path}/${CUDSS_NAME}.tar.xz

				    # only for cuda 12

				    tar xf ${CUDSS_NAME}.tar.xz

				    cp -a ${CUDSS_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDSS_NAME}/lib/* /usr/local/cuda/lib64/

				fi

				cd ..

				rm -rf tmp_cudss

				ldconfig

									
										10

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,15 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

									
										14

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -37,6 +37,9 @@ install_conda_dependencies() {

				install_pip_dependencies() {

				  pushd executorch/.ci/docker

				  # Install PyTorch CPU build beforehand to avoid installing the much bigger CUDA

				  # binaries later, ExecuTorch only needs CPU

				  pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

				  # Install all Python dependencies

				  pip_install -r requirements-ci.txt

				  popd

				@ -44,13 +47,14 @@ install_pip_dependencies() {

				setup_executorch() {

				  pushd executorch

				  source .ci/scripts/utils.sh

				  # Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate

				  as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh

				  install_flatc_from_source

				  pip_install .

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # Make sure that all the newly generate files are owned by Jenkins

				  chown -R jenkins .

				  as_jenkins .ci/scripts/setup-linux.sh cmake

				  popd

				}

									
										46

.ci/docker/common/install_halide.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,46 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				COMMIT=$(get_pinned_commit halide)

				test -n "$COMMIT"

				# activate conda to populate CONDA_PREFIX

				test -n "$ANACONDA_PYTHON_VERSION"

				eval "$(conda shell.bash hook)"

				conda activate py_$ANACONDA_PYTHON_VERSION

				if [ -n "${UBUNTU_VERSION}" ];then

				    apt update

				    apt-get install -y lld liblld-15-dev libpng-dev libjpeg-dev libgl-dev \

				                  libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev

				fi

				conda_install numpy scipy imageio cmake ninja

				git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git

				cmake -DCMAKE_BUILD_TYPE=Release \

				        -DLLVM_ENABLE_PROJECTS="clang" \

				        -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \

				        -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_ASSERTIONS=ON \

				        -DLLVM_ENABLE_EH=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_BUILD_32_BITS=OFF \

				        -S llvm-project/llvm -B llvm-build -G Ninja

				cmake --build llvm-build

				cmake --install llvm-build --prefix llvm-install

				export LLVM_ROOT=`pwd`/llvm-install

				export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config

				git clone https://github.com/halide/Halide.git

				pushd Halide

				git checkout ${COMMIT} && git submodule update --init --recursive

				pip_install -r requirements.txt

				cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build

				cmake --build build

				test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3

				cmake --install build --prefix ${CONDA_PREFIX}

				chown -R jenkins ${CONDA_PREFIX}

				popd

				rm -rf Halide llvm-build llvm-project llvm-install

				python -c "import halide"  # check for errors

									
										23

.ci/docker/common/install_libpng.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,23 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				LIBPNG_VERSION=1.6.37

				mkdir -p libpng

				pushd libpng

				wget http://download.sourceforge.net/libpng/libpng-$LIBPNG_VERSION.tar.gz

				tar -xvzf libpng-$LIBPNG_VERSION.tar.gz

				pushd libpng-$LIBPNG_VERSION

				./configure

				make

				make install

				popd

				popd

				rm -rf libpng

									
										29

.ci/docker/common/install_magma.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,29 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				MAGMA_VERSION="2.5.2"

				function do_install() {

				    cuda_version=$1

				    cuda_version_nodot=${1/./}

				    MAGMA_VERSION="2.6.1"

				    magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"

				    cuda_dir="/usr/local/cuda-${cuda_version}"

				    (

				        set -x

				        tmp_dir=$(mktemp -d)

				        pushd ${tmp_dir}

				        curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}

				        tar -xvf "${magma_archive}"

				        mkdir -p "${cuda_dir}/magma"

				        mv include "${cuda_dir}/magma/include"

				        mv lib "${cuda_dir}/magma/lib"

				        popd

				    )

				}

				do_install $1

									
										172

.ci/docker/common/install_miopen.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,172 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				ROCM_VERSION=$1

				if [[ -z $ROCM_VERSION ]]; then

				    echo "missing ROCM_VERSION"

				    exit 1;

				fi

				IS_UBUNTU=0

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    IS_UBUNTU=1

				    ;;

				  centos)

				    IS_UBUNTU=0

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				# To make version comparison easier, create an integer representation.

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})

				IFS="$save_IFS"

				if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=0

				elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				fi

				ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))

				# Install custom MIOpen + COMgr for ROCm >= 4.0.1

				if [[ $ROCM_INT -lt 40001 ]]; then

				    echo "ROCm version < 4.0.1; will not install custom MIOpen"

				    exit 0

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# Build custom MIOpen to use comgr for offline compilation.

				## Need a sanitized ROCM_VERSION without patchlevel; patchlevel version 0 must be added to paths.

				ROCM_DOTS=$(echo ${ROCM_VERSION} | tr -d -c '.' | wc -c)

				if [[ ${ROCM_DOTS} == 1 ]]; then

				    ROCM_VERSION_NOPATCH="${ROCM_VERSION}"

				    ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}.0"

				else

				    ROCM_VERSION_NOPATCH="${ROCM_VERSION%.*}"

				    ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}"

				fi

				# MIOPEN_USE_HIP_KERNELS is a Workaround for COMgr issues

				MIOPEN_CMAKE_COMMON_FLAGS="

				-DMIOPEN_USE_COMGR=ON

				-DMIOPEN_BUILD_DRIVER=OFF

				"

				# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version

				if [[ $ROCM_INT -ge 60300 ]]; then

				    echo "ROCm 6.3+ MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-6.2-staging"

				elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then

				    echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then

				    echo "ROCm 6.0 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 50700 ]] && [[ $ROCM_INT -lt 60000 ]]; then

				    echo "ROCm 5.7 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 50600 ]] && [[ $ROCM_INT -lt 50700 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-5.6-staging"

				elif [[ $ROCM_INT -ge 50500 ]] && [[ $ROCM_INT -lt 50600 ]]; then

				    MIOPEN_BRANCH="release/rocm-rel-5.5-gfx11"

				elif [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.4-staging"

				elif [[ $ROCM_INT -ge 50300 ]] && [[ $ROCM_INT -lt 50400 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.3-staging"

				elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50300 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"

				    MIOPEN_BRANCH="release/rocm-rel-5.2-staging"

				elif [[ $ROCM_INT -ge 50100 ]] && [[ $ROCM_INT -lt 50200 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"

				    MIOPEN_BRANCH="release/rocm-rel-5.1-staging"

				elif [[ $ROCM_INT -ge 50000 ]] && [[ $ROCM_INT -lt 50100 ]]; then

				    MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"

				    MIOPEN_BRANCH="release/rocm-rel-5.0-staging"

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				fi

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get remove -y miopen-hip

				else

				  yum remove -y miopen-hip

				fi

				git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}

				pushd MIOpen

				# remove .git to save disk space since CI runner was running out

				rm -rf .git

				# Don't build CK to save docker build time

				if [[ $ROCM_INT -ge 60200 ]]; then

				    sed -i '/composable_kernel/d' requirements.txt

				fi

				# Don't build MLIR to save docker build time

				# since we are disabling MLIR backend for MIOpen anyway

				if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then

				    sed -i '/rocMLIR/d' requirements.txt

				elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50400 ]]; then

				    sed -i '/llvm-project-mlir/d' requirements.txt

				fi

				## MIOpen minimum requirements

				cmake -P install_deps.cmake --minimum

				# clean up since CI runner was running out of disk space

				rm -rf /tmp/*

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				else

				  yum clean all

				  rm -rf /var/cache/yum

				  rm -rf /var/lib/yum/yumdb

				  rm -rf /var/lib/yum/history

				fi

				## Build MIOpen

				mkdir -p build

				cd build

				PKG_CONFIG_PATH=/usr/local/lib/pkgconfig CXX=${ROCM_INSTALL_PATH}/llvm/bin/clang++ cmake .. \

				    ${MIOPEN_CMAKE_COMMON_FLAGS} \

				    ${MIOPEN_CMAKE_DB_FLAGS} \

				    -DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}/hip;${ROCM_INSTALL_PATH}"

				make MIOpen -j $(nproc)

				# Build MIOpen package

				make -j $(nproc) package

				# clean up since CI runner was running out of disk space

				rm -rf /usr/local/cget

				if [[ ${IS_UBUNTU} == 1 ]]; then

				  sudo dpkg -i miopen-hip*.deb

				else

				  yum install -y miopen-*.rpm

				fi

				popd

				rm -rf MIOpen

									
										16

.ci/docker/common/install_mkl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,16 @@

				#!/bin/bash

				set -ex

				# MKL

				MKL_VERSION=2024.2.0

				MKLROOT=/opt/intel

				mkdir -p ${MKLROOT}

				pushd /tmp

				python3 -mpip install wheel

				python3 -mpip download -d . mkl-static==${MKL_VERSION}

				python3 -m wheel unpack mkl_static-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl

				python3 -m wheel unpack mkl_include-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl

				mv mkl_static-${MKL_VERSION}/mkl_static-${MKL_VERSION}.data/data/lib ${MKLROOT}

				mv mkl_include-${MKL_VERSION}/mkl_include-${MKL_VERSION}.data/data/include ${MKLROOT}

									
										13

.ci/docker/common/install_mnist.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,13 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				mkdir -p /usr/local/mnist/

				cd /usr/local/mnist

				for img in train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz; do

				  wget -q https://ossci-datasets.s3.amazonaws.com/mnist/$img

				  gzip -d $img

				done

									
										20

.ci/docker/common/install_nvpl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,20 @@

				#!/bin/bash

				set -ex

				function install_nvpl {

				    mkdir -p /opt/nvpl/lib /opt/nvpl/include

				    wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_blas/linux-sbsa/nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz

				    tar xf nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz

				    cp -r nvpl_blas-linux-sbsa-0.3.0-archive/lib/* /opt/nvpl/lib/

				    cp -r nvpl_blas-linux-sbsa-0.3.0-archive/include/* /opt/nvpl/include/

				    wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_lapack/linux-sbsa/nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz

				    tar xf nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz

				    cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/lib/* /opt/nvpl/lib/

				    cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/include/* /opt/nvpl/include/

				}

				install_nvpl

									
										11

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -15,7 +15,7 @@ pip_install \

				  flatbuffers==2.0 \

				  mock==5.0.1 \

				  ninja==1.10.2 \

				  networkx==2.0 \

				  networkx==2.5 \

				  numpy==1.24.2

				# ONNXRuntime should be installed before installing

				@ -30,10 +30,11 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18

				pip_install onnx==1.16.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240523 --no-deps

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.16.2

				pip_install onnxscript==0.1.0.dev20241009 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										22

.ci/docker/common/install_openblas.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,22 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				cd /

				git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules

				OPENBLAS_BUILD_FLAGS="

				NUM_THREADS=128

				USE_OPENMP=1

				NO_SHARED=0

				DYNAMIC_ARCH=1

				TARGET=ARMV8

				CFLAGS=-O3

				"

				OPENBLAS_CHECKOUT_DIR="OpenBLAS"

				make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}

				make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

									
										16

.ci/docker/common/install_patchelf.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,16 @@

				#!/bin/bash

				# Script used only in CD pipeline

				set -ex

				# Pin the version to latest release 0.17.2, building newer commit starts

				# to fail on the current image

				git clone -b 0.17.2 --single-branch https://github.com/NixOS/patchelf

				cd patchelf

				sed -i 's/serial/parallel/g' configure.ac

				./bootstrap.sh

				./configure

				make

				make install

				cd ..

				rm -rf patchelf

									
										150

.ci/docker/common/install_rocm_drm.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,150 @@

				#!/bin/bash

				# Script used only in CD pipeline

				###########################

				### prereqs

				###########################

				# Install Python packages depending on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    apt-get update -y

				    apt-get install -y libpciaccess-dev pkg-config

				    apt-get clean

				    ;;

				  centos)

				    yum install -y libpciaccess-devel pkgconfig

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

				python3 -m pip install meson ninja

				###########################

				### clone repo

				###########################

				GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git

				pushd drm

				###########################

				### patch

				###########################

				patch -p1 <<'EOF'

				diff --git a/amdgpu/amdgpu_asic_id.c b/amdgpu/amdgpu_asic_id.c

				index a5007ffc..13fa07fc 100644

				--- a/amdgpu/amdgpu_asic_id.c

				+++ b/amdgpu/amdgpu_asic_id.c

				@@ -22,6 +22,13 @@

				  *

				  */

				+#define _XOPEN_SOURCE 700

				+#define _LARGEFILE64_SOURCE

				+#define _FILE_OFFSET_BITS 64

				+#include <ftw.h>

				+#include <link.h>

				+#include <limits.h>

				+

				 #include <ctype.h>

				 #include <stdio.h>

				 #include <stdlib.h>

				@@ -34,6 +41,19 @@

				 #include "amdgpu_drm.h"

				 #include "amdgpu_internal.h"

				+static char *amdgpuids_path = NULL;

				+static const char* amdgpuids_path_msg = NULL;

				+

				+static int check_for_location_of_amdgpuids(const char *filepath, const struct stat *info, const int typeflag, struct FTW *pathinfo)

				+{

				+	if (typeflag == FTW_F && strstr(filepath, "amdgpu.ids")) {

				+		amdgpuids_path = strdup(filepath);

				+		return 1;

				+	}

				+

				+	return 0;

				+}

				+

				 static int parse_one_line(struct amdgpu_device *dev, const char *line)

				 {

				 	char *buf, *saveptr;

				@@ -113,10 +133,46 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)

				 	int line_num = 1;

				 	int r = 0;

				+	// attempt to find typical location for amdgpu.ids file

				 	fp = fopen(AMDGPU_ASIC_ID_TABLE, "r");

				+

				+	// if it doesn't exist, search

				+	if (!fp) {

				+

				+	char self_path[ PATH_MAX ];

				+	ssize_t count;

				+	ssize_t i;

				+

				+	count = readlink( "/proc/self/exe", self_path, PATH_MAX );

				+	if (count > 0) {

				+		self_path[count] = '\0';

				+

				+		// remove '/bin/python' from self_path

				+		for (i=count; i>0; --i) {

				+			if (self_path[i] == '/') break;

				+			self_path[i] = '\0';

				+		}

				+		self_path[i] = '\0';

				+		for (; i>0; --i) {

				+			if (self_path[i] == '/') break;

				+			self_path[i] = '\0';

				+		}

				+		self_path[i] = '\0';

				+

				+		if (1 == nftw(self_path, check_for_location_of_amdgpuids, 5, FTW_PHYS)) {

				+			fp = fopen(amdgpuids_path, "r");

				+			amdgpuids_path_msg = amdgpuids_path;

				+		}

				+	}

				+

				+	}

				+	else {

				+		amdgpuids_path_msg = AMDGPU_ASIC_ID_TABLE;

				+	}

				+

				+	// both hard-coded location and search have failed

				 	if (!fp) {

				-		fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,

				-			strerror(errno));

				+		fprintf(stderr, "amdgpu.ids: No such file or directory\n");

				 		return;

				 	}

				@@ -132,7 +188,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)

				 			continue;

				 		}

				-		drmMsg("%s version: %s\n", AMDGPU_ASIC_ID_TABLE, line);

				+		drmMsg("%s version: %s\n", amdgpuids_path_msg, line);

				 		break;

				 	}

				@@ -150,7 +206,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)

				 	if (r == -EINVAL) {

				 		fprintf(stderr, "Invalid format: %s: line %d: %s\n",

				-			AMDGPU_ASIC_ID_TABLE, line_num, line);

				+			amdgpuids_path_msg, line_num, line);

				 	} else if (r && r != -EAGAIN) {

				 		fprintf(stderr, "%s: Cannot parse ASIC IDs: %s\n",

				 			__func__, strerror(-r));

				EOF

				###########################

				### build

				###########################

				meson builddir --prefix=/opt/amdgpu

				pushd builddir

				ninja install

				popd

				popd

									
										13

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -1,7 +1,11 @@

				#!/bin/bash

				# Script used in CI and CD pipeline

				set -ex

				MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

				# "install" hipMAGMA into /opt/rocm/magma by copying after build

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				@ -11,7 +15,10 @@ git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc

				if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then

				    echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc

				fi

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc

				echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc

				export PATH="${PATH}:/opt/rocm/bin"

				if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

				@ -25,7 +32,7 @@ done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

				make -f make.gen.hipMAGMA -j $(nproc)

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION

				make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION

				LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"

				make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"

				popd

				mv magma /opt/rocm

									
										31

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -12,14 +12,14 @@ conda_reinstall() {

				  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*

				}

				if [ -n "${ROCM_VERSION}" ]; then

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_TEXT_FILE="triton-rocm"

				elif [ -n "${XPU_VERSION}" ]; then

				if [ -n "${XPU_VERSION}" ]; then

				  TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"

				  TRITON_TEXT_FILE="triton-xpu"

				elif [ -n "${TRITON_CPU}" ]; then

				  TRITON_REPO="https://github.com/triton-lang/triton-cpu"

				  TRITON_TEXT_FILE="triton-cpu"

				else

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_REPO="https://github.com/triton-lang/triton"

				  TRITON_TEXT_FILE="triton"

				fi

				@ -41,19 +41,34 @@ if [ -z "${MAX_JOBS}" ]; then

				    export MAX_JOBS=$(nproc)

				fi

				# Git checkout triton

				mkdir /var/lib/jenkins/triton

				chown -R jenkins /var/lib/jenkins/triton

				chgrp -R jenkins /var/lib/jenkins/triton

				pushd /var/lib/jenkins/

				as_jenkins git clone --recursive ${TRITON_REPO} triton

				cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				as_jenkins git submodule update --init --recursive

				cd python

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

				if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  CXX=g++-9 pip_install -e .

				elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  CXX=g++-9 pip_install -e .

				else

				  pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  pip_install -e .

				fi

				if [ -n "${CONDA_CMAKE}" ]; then

									
										118

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -1,6 +1,6 @@

				#!/bin/bash

				set -xe

				# Script used in CI and CD pipeline

				# Intel® software for general purpose GPU capabilities.

				# Refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html

				@ -8,19 +8,23 @@ set -xe

				# Users should update to the latest version as it becomes available

				function install_ubuntu() {

				    . /etc/os-release

				    if [[ ! " jammy " =~ " ${VERSION_CODENAME} " ]]; then

				        echo "Ubuntu version ${VERSION_CODENAME} not supported"

				        exit

				    fi

				    apt-get update -y

				    apt-get install -y gpg-agent wget

				    # Set up the repository. To do this, download the key to the system keyring

				    # To add the online network package repository for the GPU Driver

				    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \

				        | gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor --output /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg

				    # Add the signed entry to APT sources and configure the APT client to use the Intel repository

				        | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \

				        https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-jammy.list

				        https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}${XPU_DRIVER_VERSION} unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list

				    # To add the online network network package repository for the Intel Support Packages

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				        | gpg --dearmor > /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg

				    echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \

				        https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \

				        | tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list

				@ -37,13 +41,16 @@ function install_ubuntu() {

				        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \

				        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \

				        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

				    if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				        apt-get install -y intel-ocloc

				    fi

				    # Development Packages

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel Support Packages

				    if [ -n "$XPU_VERSION" ]; then

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION}

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev-0.9

				    else

				        apt-get install -y intel-for-pytorch-gpu-dev

				        apt-get install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9

				    fi

				    # Cleanup

				@ -51,44 +58,49 @@ function install_ubuntu() {

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				}

				function install_centos() {

				    dnf install -y 'dnf-command(config-manager)'

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/8.6/production/2328/unified/intel-gpu-8.6.repo

				    # To add the EPEL repository needed for DKMS

				    dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

				        # https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

				function install_rhel() {

				    . /etc/os-release

				    if [[ "${ID}" == "rhel" ]]; then

				        if [[ ! " 8.6 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then

				            echo "RHEL version ${VERSION_ID} not supported"

				            exit

				        fi

				    elif [[ "${ID}" == "almalinux" ]]; then

				        # Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64

				        VERSION_ID="8.6"

				    fi

				    # Create the YUM repository file in the /temp directory as a normal user

				    tee > /tmp/oneAPI.repo << EOF

				[oneAPI]

				name=Intel® oneAPI repository

				baseurl=https://yum.repos.intel.com/oneapi

				    dnf install -y 'dnf-command(config-manager)'

				    # To add the online network package repository for the GPU Driver

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/${VERSION_ID}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_ID}.repo

				    # To add the online network network package repository for the Intel Support Packages

				    tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF

				[intel-for-pytorch-gpu-dev]

				name=Intel for Pytorch GPU dev repository

				baseurl=https://yum.repos.intel.com/intel-for-pytorch-gpu-dev

				enabled=1

				gpgcheck=1

				repo_gpgcheck=1

				gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				EOF

				    # Move the newly created oneAPI.repo file to the YUM configuration directory /etc/yum.repos.d

				    mv /tmp/oneAPI.repo /etc/yum.repos.d

				    # The xpu-smi packages

				    dnf install -y flex bison xpu-smi

				    dnf install -y xpu-smi

				    # Compute and Media Runtimes

				    dnf install -y \

				    dnf install --skip-broken -y \

				        intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\

				        level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \

				        mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \

				        mesa-libxatracker libvpl-tools intel-metrics-discovery \

				        intel-metrics-library intel-igc-core intel-igc-cm \

				        libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc hwinfo clinfo

				        libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc

				    # Development packages

				    dnf install -y --refresh \

				        intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \

				        level-zero-devel

				    # Install Intel® oneAPI Base Toolkit

				    dnf install intel-basekit -y

				    # Install Intel Support Packages

				    yum install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9

				    # Cleanup

				    dnf clean all

				@ -97,6 +109,41 @@ EOF

				    rm -rf /var/lib/yum/history

				}

				function install_sles() {

				    . /etc/os-release

				    VERSION_SP=${VERSION_ID//./sp}

				    if [[ ! " 15sp4 15sp5 " =~ " ${VERSION_SP} " ]]; then

				        echo "SLES version ${VERSION_ID} not supported"

				        exit

				    fi

				    # To add the online network package repository for the GPU Driver

				    zypper addrepo -f -r \

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo

				    rpm --import https://repositories.intel.com/gpu/intel-graphics.key

				    # To add the online network network package repository for the Intel Support Packages

				    zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev

				    rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				    # The xpu-smi packages

				    zypper install -y lsb-release flex bison xpu-smi

				    # Compute and Media Runtimes

				    zypper install -y intel-level-zero-gpu level-zero intel-gsc intel-opencl intel-ocloc \

				        intel-media-driver libigfxcmrt7 libvpl2 libvpl-tools libmfxgen1 libmfx1

				    # Development packages

				    zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel

				    # Install Intel Support Packages

				    zypper install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9

				}

				# Default use GPU driver LTS releases

				XPU_DRIVER_VERSION="/lts/2350"

				if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    # Use GPU driver rolling releases

				    XPU_DRIVER_VERSION=""

				fi

				# The installation depends on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				@ -104,8 +151,11 @@ case "$ID" in

				    ubuntu)

				        install_ubuntu

				    ;;

				    centos)

				        install_centos

				    rhel|almalinux)

				        install_rhel

				    ;;

				    sles)

				        install_sles

				    ;;

				    *)

				        echo "Unable to determine OS..."

									
										105

.ci/docker/conda/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,105 @@

				ARG CUDA_VERSION=10.2

				ARG BASE_TARGET=cuda${CUDA_VERSION}

				FROM centos:7 as base

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum update -y

				RUN yum install -y wget curl perl util-linux xz bzip2 git patch which unzip

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				# EPEL for cmake

				RUN yum --enablerepo=extras install -y epel-release

				# cmake

				RUN yum install -y cmake3 && \

				    ln -s /usr/bin/cmake3 /usr/bin/cmake

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				RUN yum install -y autoconf aclocal automake make sudo

				RUN rm -rf /usr/local/cuda-*

				FROM base as patchelf

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh && cp $(which patchelf) /patchelf

				FROM base as openssl

				# Install openssl

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				FROM base as conda

				# Install Anaconda

				ADD ./common/install_conda_docker.sh install_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh

				# Install CUDA

				FROM base as cuda

				ARG CUDA_VERSION=10.2

				RUN rm -rf /usr/local/cuda-*

				ADD ./common/install_cuda.sh install_cuda.sh

				ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

				# Preserve CUDA_VERSION for the builds

				ENV CUDA_VERSION=${CUDA_VERSION}

				# Make things in our path by default

				ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH

				FROM cuda as cuda11.8

				RUN bash ./install_cuda.sh 11.8

				ENV DESIRED_CUDA=11.8

				FROM cuda as cuda12.1

				RUN bash ./install_cuda.sh 12.1

				ENV DESIRED_CUDA=12.1

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				ENV DESIRED_CUDA=12.4

				FROM cuda as cuda12.6

				RUN bash ./install_cuda.sh 12.6

				ENV DESIRED_CUDA=12.6

				# Install MNIST test data

				FROM base as mnist

				ADD ./common/install_mnist.sh install_mnist.sh

				RUN bash ./install_mnist.sh

				FROM base as all_cuda

				COPY --from=cuda11.8  /usr/local/cuda-11.8 /usr/local/cuda-11.8

				COPY --from=cuda12.1  /usr/local/cuda-12.1 /usr/local/cuda-12.1

				COPY --from=cuda12.4  /usr/local/cuda-12.4 /usr/local/cuda-12.4

				COPY --from=cuda12.6  /usr/local/cuda-12.6 /usr/local/cuda-12.6

				# Final step

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

				COPY --from=patchelf           /patchelf              /usr/local/bin/patchelf

				COPY --from=conda              /opt/conda             /opt/conda

				# Add jni.h for java host build.

				COPY ./common/install_jni.sh install_jni.sh

				COPY ./java/jni.h jni.h

				RUN bash ./install_jni.sh && rm install_jni.sh

				ENV  PATH /opt/conda/bin:$PATH

				COPY --from=mnist  /usr/local/mnist /usr/local/mnist

				RUN rm -rf /usr/local/cuda

				RUN chmod o+rw /usr/local

				RUN touch /.condarc && \

				    chmod o+rw /.condarc && \

				    chmod -R o+rw /opt/conda

									
										82

.ci/docker/conda/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,82 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  exit 1

				fi

				DOCKER_IMAGE_NAME="pytorch/${image}"

				export DOCKER_BUILDKIT=1

				TOPDIR=$(git rev-parse --show-toplevel)

				CUDA_VERSION=${CUDA_VERSION:-12.1}

				case ${CUDA_VERSION} in

				  cpu)

				    BASE_TARGET=base

				    DOCKER_TAG=cpu

				    ;;

				  all)

				    BASE_TARGET=all_cuda

				    DOCKER_TAG=latest

				    ;;

				  *)

				    BASE_TARGET=cuda${CUDA_VERSION}

				    DOCKER_TAG=cuda${CUDA_VERSION}

				    ;;

				esac

				(

				  set -x

				  # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				  # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				  sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				  sudo systemctl daemon-reload

				  sudo systemctl restart docker

				  docker build \

				    --target final \

				    --progress plain \

				    --build-arg "BASE_TARGET=${BASE_TARGET}" \

				    --build-arg "CUDA_VERSION=${CUDA_VERSION}" \

				    --build-arg "DEVTOOLSET_VERSION=9" \

				    -t ${DOCKER_IMAGE_NAME} \

				    $@ \

				    -f "${TOPDIR}/.ci/docker/conda/Dockerfile" \

				    ${TOPDIR}/.ci/docker/

				)

				if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then

				  # Test that we're using the right CUDA compiler

				  (

				    set -x

				    docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"

				  )

				fi

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH:-}" == true ]]; then

				  (

				    set -x

				    docker push "${DOCKER_IMAGE_NAME}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}

				        docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}

				        docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				        docker push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				fi

									
										107

.ci/docker/libtorch/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,107 @@

				ARG BASE_TARGET=base

				ARG GPU_IMAGE=ubuntu:20.04

				FROM ${GPU_IMAGE} as base

				ENV DEBIAN_FRONTEND=noninteractive

				RUN apt-get clean && apt-get update

				RUN apt-get install -y curl locales g++ git-all autoconf automake make cmake wget unzip sudo

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN locale-gen en_US.UTF-8

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				# Install openssl

				FROM base as openssl

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				# Install python

				FROM base as python

				ADD common/install_cpython.sh install_cpython.sh

				RUN apt-get update -y && \

				    apt-get install build-essential gdb lcov libbz2-dev libffi-dev \

				        libgdbm-dev liblzma-dev libncurses5-dev libreadline6-dev \

				        libsqlite3-dev libssl-dev lzma lzma-dev tk-dev uuid-dev zlib1g-dev -y && \

				    bash ./install_cpython.sh && \

				    rm install_cpython.sh && \

				    apt-get clean

				FROM base as conda

				ADD ./common/install_conda_docker.sh install_conda.sh

				RUN bash ./install_conda.sh && rm install_conda.sh

				FROM base as cpu

				# Install Anaconda

				COPY --from=conda /opt/conda /opt/conda

				# Install python

				COPY --from=python /opt/python    /opt/python

				COPY --from=python /opt/_internal /opt/_internal

				ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH

				# Install MKL

				ADD ./common/install_mkl.sh install_mkl.sh

				RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM cpu as cuda

				ADD ./common/install_cuda.sh install_cuda.sh

				ADD ./common/install_magma.sh install_magma.sh

				ENV CUDA_HOME /usr/local/cuda

				FROM cuda as cuda11.8

				RUN bash ./install_cuda.sh 11.8

				RUN bash ./install_magma.sh 11.8

				RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda

				FROM cuda as cuda12.1

				RUN bash ./install_cuda.sh 12.1

				RUN bash ./install_magma.sh 12.1

				RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda

				FROM cuda as cuda12.4

				RUN bash ./install_cuda.sh 12.4

				RUN bash ./install_magma.sh 12.4

				RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda

				FROM cpu as rocm

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				ENV MKLROOT /opt/intel

				# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)

				# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.

				# Remove below when ROCm5.7 is not in support matrix anymore.

				ENV ROCM_PATH /opt/rocm

				# No need to install ROCm as base docker image should have full ROCm install

				#ADD ./common/install_rocm.sh install_rocm.sh

				ADD ./common/install_rocm_drm.sh install_rocm_drm.sh

				ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				# gfortran and python needed for building magma from source for ROCm

				RUN apt-get update -y && \

				    apt-get install gfortran -y && \

				    apt-get install python -y && \

				    apt-get clean

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				FROM ${BASE_TARGET} as final

				COPY --from=openssl            /opt/openssl           /opt/openssl

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh

				# Install Anaconda

				COPY --from=conda /opt/conda /opt/conda

				# Install python

				COPY --from=python /opt/python    /opt/python

				COPY --from=python /opt/_internal /opt/_internal

				ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH

									
										93

.ci/docker/libtorch/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,93 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				TOPDIR=$(git rev-parse --show-toplevel)

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				WITH_PUSH=${WITH_PUSH:-}

				DOCKER=${DOCKER:-docker}

				case ${GPU_ARCH_TYPE} in

				    cpu)

				        BASE_TARGET=cpu

				        DOCKER_TAG=cpu

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    cuda)

				        BASE_TARGET=cuda${GPU_ARCH_VERSION}

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=ubuntu:20.04

				        DOCKER_GPU_BUILD_ARG=""

				        ;;

				    rocm)

				        BASE_TARGET=rocm

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"

				        ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"

				        if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then

				            ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))

				        else

				            echo "ERROR: rocm regex failed"

				            exit 1

				        fi

				        if [[ $ROCM_VERSION_INT -ge 60000 ]]; then

				            PYTORCH_ROCM_ARCH+=";gfx942"

				        fi

				        DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        exit 1

				        ;;

				esac

				(

				    set -x

				    DOCKER_BUILDKIT=1 ${DOCKER} build \

				         --target final \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --build-arg "BASE_TARGET=${BASE_TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				  (

				    set -x

				    ${DOCKER} push "${DOCKER_IMAGE}"

				    if [[ -n ${GITHUB_REF} ]]; then

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				        ${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				        ${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"

				        ${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"

				    fi

				  )

				fi

									
										2

.ci/docker/linter-cuda/Dockerfile
									
												View File
												
				@ -29,7 +29,7 @@ RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/re

				# Install cuda and cudnn

				ARG CUDA_VERSION

				RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

									
										203

.ci/docker/manywheel/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,203 @@

				# syntax = docker/dockerfile:experimental

				ARG ROCM_VERSION=3.7

				ARG BASE_CUDA_VERSION=11.8

				ARG GPU_IMAGE=centos:7

				FROM centos:7 as base

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				ARG DEVTOOLSET_VERSION=9

				# Note: This is required patch since CentOS have reached EOL

				# otherwise any yum install setp will fail

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel

				# Just add everything as a safe.directory for git since these will be used in multiple places with git

				RUN git config --global --add safe.directory '*'

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms

				# patch is required once again. Somehow this steps adds mirror.centos.org

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				RUN yum --enablerepo=extras install -y epel-release

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				    python3 -mpip install cmake==3.18.4 && \

				    ln -s /usr/local/bin/cmake /usr/bin/cmake

				RUN yum install -y autoconf aclocal automake make sudo

				FROM base as openssl

				# Install openssl (this must precede `build python` step)

				# (In order to have a proper SSL module, Python is compiled

				# against a recent openssl [see env vars above], which is linked

				# statically. We delete openssl afterwards.)

				ADD ./common/install_openssl.sh install_openssl.sh

				RUN bash ./install_openssl.sh && rm install_openssl.sh

				# EPEL for cmake

				FROM base as patchelf

				# Install patchelf

				ADD ./common/install_patchelf.sh install_patchelf.sh

				RUN bash ./install_patchelf.sh && rm install_patchelf.sh

				RUN cp $(which patchelf) /patchelf

				FROM patchelf as python

				# build python

				COPY manywheel/build_scripts /build_scripts

				ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh

				RUN bash build_scripts/build.sh && rm -r build_scripts

				FROM base as cuda

				ARG BASE_CUDA_VERSION=10.2

				# Install CUDA

				ADD ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh

				FROM base as intel

				# MKL

				ADD ./common/install_mkl.sh install_mkl.sh

				RUN bash ./install_mkl.sh && rm install_mkl.sh

				FROM base as magma

				ARG BASE_CUDA_VERSION=10.2

				# Install magma

				ADD ./common/install_magma.sh install_magma.sh

				RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh

				FROM base as jni

				# Install java jni header

				ADD ./common/install_jni.sh install_jni.sh

				ADD ./java/jni.h jni.h

				RUN bash ./install_jni.sh && rm install_jni.sh

				FROM base as libpng

				# Install libpng

				ADD ./common/install_libpng.sh install_libpng.sh

				RUN bash ./install_libpng.sh && rm install_libpng.sh

				FROM ${GPU_IMAGE} as common

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				ENV LC_ALL en_US.UTF-8

				ENV LANG en_US.UTF-8

				ENV LANGUAGE en_US.UTF-8

				RUN yum install -y \

				        aclocal \

				        autoconf \

				        automake \

				        bison \

				        bzip2 \

				        curl \

				        diffutils \

				        file \

				        git \

				        make \

				        patch \

				        perl \

				        unzip \

				        util-linux \

				        wget \

				        which \

				        xz \

				        yasm

				RUN yum install -y \

				    https://repo.ius.io/ius-release-el7.rpm \

				    https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm

				RUN yum swap -y git git236-core

				# git236+ would refuse to run git commands in repos owned by other users

				# Which causes version check to fail, as pytorch repo is bind-mounted into the image

				# Override this behaviour by treating every folder as safe

				# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327

				RUN git config --global --add safe.directory "*"

				ENV SSL_CERT_FILE=/opt/_internal/certs.pem

				# Install LLVM version

				COPY --from=openssl            /opt/openssl                          /opt/openssl

				COPY --from=python             /opt/python                           /opt/python

				COPY --from=python             /opt/_internal                        /opt/_internal

				COPY --from=python             /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel

				COPY --from=intel              /opt/intel                            /opt/intel

				COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf

				COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h

				COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/

				COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/

				COPY --from=libpng             /usr/local/include/png*               /usr/local/include/

				COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/

				COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/

				COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig

				FROM common as cpu_final

				ARG BASE_CUDA_VERSION=10.1

				ARG DEVTOOLSET_VERSION=9

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y yum-utils centos-release-scl

				RUN yum-config-manager --enable rhel-server-rhscl-7-rpms

				RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				# cmake is already installed inside the rocm base image, so remove if present

				RUN rpm -e cmake || true

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				    python3 -mpip install cmake==3.18.4 && \

				    ln -s /usr/local/bin/cmake /usr/bin/cmake

				# ninja

				RUN yum install -y ninja-build

				FROM cpu_final as cuda_final

				RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}

				COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}

				COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}

				RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda

				ENV PATH=/usr/local/cuda/bin:$PATH

				FROM cpu_final as rocm_final

				ARG ROCM_VERSION=3.7

				ARG PYTORCH_ROCM_ARCH

				ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

				# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)

				# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.

				# Remove below when ROCm5.7 is not in support matrix anymore.

				ENV ROCM_PATH /opt/rocm

				ENV MKLROOT /opt/intel

				# No need to install ROCm as base docker image should have full ROCm install

				#ADD ./common/install_rocm.sh install_rocm.sh

				#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh

				ADD ./common/install_rocm_drm.sh install_rocm_drm.sh

				RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh

				# cmake3 is needed for the MIOpen build

				RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3

				ADD ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

153

.ci/docker/manywheel/Dockerfile_2014 Normal file

View File

 @ -0,0 +1,153 @@
 # syntax = docker/dockerfile:experimental
 ARG ROCM_VERSION=3.7
 ARG BASE_CUDA_VERSION=10.2
 ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
 FROM quay.io/pypa/manylinux2014_x86_64 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
 RUN yum install -y yum-utils centos-release-scl sudo
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION=10.2
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 FROM base as intel
 # MKL
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION=10.2
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as jni
 # Install java jni header
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 # Install libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM ${GPU_IMAGE} as common
 RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
 RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN yum install -y \
         aclocal \
         autoconf \
         automake \
         bison \
         bzip2 \
         curl \
         diffutils \
         file \
         git \
         make \
         patch \
         perl \
         unzip \
         util-linux \
         wget \
         which \
         xz \
         yasm
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # Install LLVM version
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=base               /opt/python                           /opt/python
 COPY --from=base               /opt/_internal                        /opt/_internal
 COPY --from=base               /usr/local/bin/auditwheel             /usr/local/bin/auditwheel
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=base               /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=10.2
 RUN yum install -y yum-utils centos-release-scl
 RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
 RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
 ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
 # cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 # ninja
 RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
 RUN yum install -y ninja-build
 FROM cpu_final as cuda_final
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 FROM common as rocm_final
 ARG ROCM_VERSION=3.7
 # Install ROCm
 ADD ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
 # cmake is already installed inside the rocm base image, but both 2 and 3 exist
 # cmake3 is needed for the later MIOpen custom build, so that step is last.
 RUN yum install -y cmake3 && \
     rm -f /usr/bin/cmake && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

157

.ci/docker/manywheel/Dockerfile_2_28 Normal file

View File

 @ -0,0 +1,157 @@
 # syntax = docker/dockerfile:experimental
 ARG ROCM_VERSION=3.7
 ARG BASE_CUDA_VERSION=11.8
 ARG GPU_IMAGE=amd64/almalinux:8
 FROM quay.io/pypa/manylinux_2_28_x86_64 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 ARG DEVTOOLSET_VERSION=11
 RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # cmake-3.18.4 from pip
 RUN yum install -y python3-pip && \
     python3 -mpip install cmake==3.18.4 && \
     ln -s /usr/local/bin/cmake /usr/bin/cmake3
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION=11.8
 # Install CUDA
 ADD ./common/install_cuda.sh install_cuda.sh
 RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
 FROM base as intel
 # MKL
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION=10.2
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as jni
 # Install java jni header
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 # Install libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM ${GPU_IMAGE} as common
 ARG DEVTOOLSET_VERSION=11
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
         autoconf \
         automake \
         bison \
         bzip2 \
         curl \
         diffutils \
         file \
         git \
         make \
         patch \
         perl \
         unzip \
         util-linux \
         wget \
         which \
         xz \
         gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
         glibc-langpack-en
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # Install LLVM version
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=base               /opt/python                           /opt/python
 COPY --from=base               /opt/_internal                        /opt/_internal
 COPY --from=base               /usr/local/bin/auditwheel             /usr/local/bin/auditwheel
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=base               /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 FROM common as cpu_final
 ARG BASE_CUDA_VERSION=11.8
 ARG DEVTOOLSET_VERSION=11
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # cmake-3.18.4 from pip
 RUN yum install -y python3-pip && \
     python3 -mpip install cmake==3.18.4 && \
     ln -s /usr/local/bin/cmake /usr/bin/cmake3
 FROM cpu_final as cuda_final
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 FROM common as rocm_final
 ARG ROCM_VERSION=3.7
 # Install ROCm
 ADD ./common/install_rocm.sh install_rocm.sh
 RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
 # cmake is already installed inside the rocm base image, but both 2 and 3 exist
 # cmake3 is needed for the later MIOpen custom build, so that step is last.
 RUN yum install -y cmake3 && \
     rm -f /usr/bin/cmake && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
 FROM cpu_final as xpu_final
 # XPU CD use rolling driver
 ENV XPU_DRIVER_TYPE ROLLING
 # cmake-3.28.4 from pip
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 # Install setuptools and wheel for python 3.13
 RUN /opt/python/cp313-cp313/bin/python -m pip install setuptools wheel
 ADD ./common/install_xpu.sh install_xpu.sh
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

57

.ci/docker/manywheel/Dockerfile_2_28_aarch64 Normal file

View File

 @ -0,0 +1,57 @@
 FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 # Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
 ARG GCCTOOLSET_VERSION=11
 # Language variabes
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   less \
   libffi-devel \
   libgomp \
   make \
   openssl-devel \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz \
   yasm \
   zstd \
   sudo \
   gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as final
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6

94

.ci/docker/manywheel/Dockerfile_aarch64 Normal file

View File

 @ -0,0 +1,94 @@
 FROM quay.io/pypa/manylinux2014_aarch64 as base
 # Graviton needs GCC 10 for the build
 ARG DEVTOOLSET_VERSION=10
 # Language variabes
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   make \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz \
   yasm \
   less \
   zstd \
   libgomp \
   sudo \
   devtoolset-${DEVTOOLSET_VERSION}-gcc \
   devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
   devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
   devtoolset-${DEVTOOLSET_VERSION}-binutils
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 ###############################################################################
 # libglfortran.a hack
 #
 # libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
 # This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
 # ubuntu's libgfortran.a which is compiled with -fPIC
 # NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
 ###############################################################################
 RUN cd ~/ \
   && curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb \
   && ar x ~/libgfortran-10-dev.deb \
   && tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
   && cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
 # install cmake
 RUN yum install -y cmake3 && \
     ln -s /usr/bin/cmake3 /usr/bin/cmake
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 FROM base as openblas
 # Install openblas
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM openssl as final
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

91

.ci/docker/manywheel/Dockerfile_cuda_aarch64 Normal file

View File

 @ -0,0 +1,91 @@
 FROM quay.io/pypa/manylinux_2_28_aarch64 as base
 # Cuda ARM build needs gcc 11
 ARG DEVTOOLSET_VERSION=11
 # Language variables
 ENV LC_ALL=en_US.UTF-8
 ENV LANG=en_US.UTF-8
 ENV LANGUAGE=en_US.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN yum -y install epel-release
 RUN yum -y update
 RUN yum install -y \
   autoconf \
   automake \
   bison \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   make \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz \
   yasm \
   less \
   zstd \
   libgomp \
   sudo \
   gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
 # Ensure the expected devtoolset is used
 ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 FROM openssl as final
 # remove unncessary python versions
 RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
 RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
 RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
 RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
 FROM base as cuda
 ARG BASE_CUDA_VERSION
 # Install CUDA
 ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
 RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
 FROM base as magma
 ARG BASE_CUDA_VERSION
 # Install magma
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as nvpl
 # Install nvpl
 ADD ./common/install_nvpl.sh install_nvpl.sh
 RUN bash ./install_nvpl.sh && rm install_nvpl.sh
 FROM final as cuda_final
 ARG BASE_CUDA_VERSION
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=nvpl /opt/nvpl/lib/  /usr/local/lib/
 COPY --from=nvpl /opt/nvpl/include/  /usr/local/include/
 RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
 ENV PATH=/usr/local/cuda/bin:$PATH

71

.ci/docker/manywheel/Dockerfile_cxx11-abi Normal file

View File

 @ -0,0 +1,71 @@
 FROM centos:8 as base
 ENV LC_ALL en_US.UTF-8
 ENV LANG en_US.UTF-8
 ENV LANGUAGE en_US.UTF-8
 ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 # change to a valid repo
 RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
 # enable to install ninja-build
 RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
 RUN yum -y update
 RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
 RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
 FROM base as openssl
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 # Install python
 FROM base as python
 RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
 ADD common/install_cpython.sh install_cpython.sh
 RUN bash ./install_cpython.sh && rm install_cpython.sh
 FROM base as conda
 ADD ./common/install_conda_docker.sh install_conda.sh
 RUN bash ./install_conda.sh && rm install_conda.sh
 RUN /opt/conda/bin/conda install -y cmake
 FROM base as intel
 # Install MKL
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=conda              /opt/conda                            /opt/conda
 ENV PATH=/opt/conda/bin:$PATH
 ADD ./common/install_mkl.sh install_mkl.sh
 RUN bash ./install_mkl.sh && rm install_mkl.sh
 FROM base as patchelf
 ADD ./common/install_patchelf.sh install_patchelf.sh
 RUN bash ./install_patchelf.sh && rm install_patchelf.sh
 RUN cp $(which patchelf) /patchelf
 FROM base as jni
 ADD ./common/install_jni.sh install_jni.sh
 ADD ./java/jni.h jni.h
 RUN bash ./install_jni.sh && rm install_jni.sh
 FROM base as libpng
 ADD ./common/install_libpng.sh install_libpng.sh
 RUN bash ./install_libpng.sh && rm install_libpng.sh
 FROM base as final
 COPY --from=openssl            /opt/openssl                          /opt/openssl
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=intel              /opt/intel                            /opt/intel
 COPY --from=conda              /opt/conda                            /opt/conda
 COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf
 COPY --from=jni                /usr/local/include/jni.h              /usr/local/include/jni.h
 COPY --from=libpng             /usr/local/bin/png*                   /usr/local/bin/
 COPY --from=libpng             /usr/local/bin/libpng*                /usr/local/bin/
 COPY --from=libpng             /usr/local/include/png*               /usr/local/include/
 COPY --from=libpng             /usr/local/include/libpng*            /usr/local/include/
 COPY --from=libpng             /usr/local/lib/libpng*                /usr/local/lib/
 COPY --from=libpng             /usr/local/lib/pkgconfig              /usr/local/lib/pkgconfig
 RUN yum install -y ninja-build

73

.ci/docker/manywheel/Dockerfile_s390x Normal file

View File

 @ -0,0 +1,73 @@
 FROM --platform=linux/s390x docker.io/ubuntu:24.04 as base
 # Language variables
 ENV LC_ALL=C.UTF-8
 ENV LANG=C.UTF-8
 ENV LANGUAGE=C.UTF-8
 # Installed needed OS packages. This is to support all
 # the binary builds (torch, vision, audio, text, data)
 RUN apt update ; apt upgrade -y
 RUN apt install -y \
   build-essential \
   autoconf \
   automake \
   bzip2 \
   curl \
   diffutils \
   file \
   git \
   make \
   patch \
   perl \
   unzip \
   util-linux \
   wget \
   which \
   xz-utils \
   less \
   zstd \
   cmake \
   python3 \
   python3-dev \
   python3-setuptools \
   python3-yaml \
   python3-typing-extensions \
   libblas-dev \
   libopenblas-dev \
   liblapack-dev \
   libatlas-base-dev
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 # Override this behaviour by treating every folder as safe
 # For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
 RUN git config --global --add safe.directory "*"
 FROM base as openssl
 # Install openssl (this must precede `build python` step)
 # (In order to have a proper SSL module, Python is compiled
 # against a recent openssl [see env vars above], which is linked
 # statically. We delete openssl afterwards.)
 ADD ./common/install_openssl.sh install_openssl.sh
 RUN bash ./install_openssl.sh && rm install_openssl.sh
 ENV SSL_CERT_FILE=/opt/_internal/certs.pem
 # EPEL for cmake
 FROM base as patchelf
 # Install patchelf
 ADD ./common/install_patchelf.sh install_patchelf.sh
 RUN bash ./install_patchelf.sh && rm install_patchelf.sh
 RUN cp $(which patchelf) /patchelf
 FROM patchelf as python
 # build python
 COPY manywheel/build_scripts /build_scripts
 ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
 RUN bash build_scripts/build.sh && rm -r build_scripts
 FROM openssl as final
 COPY --from=python             /opt/python                           /opt/python
 COPY --from=python             /opt/_internal                        /opt/_internal
 COPY --from=python             /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
 COPY --from=patchelf           /usr/local/bin/patchelf               /usr/local/bin/patchelf

									
										161

.ci/docker/manywheel/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,161 @@

				#!/usr/bin/env bash

				# Script used only in CD pipeline

				set -eou pipefail

				TOPDIR=$(git rev-parse --show-toplevel)

				image="$1"

				shift

				if [ -z "${image}" ]; then

				  echo "Usage: $0 IMAGE"

				  exit 1

				fi

				DOCKER_IMAGE="pytorch/${image}"

				DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}

				MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}

				DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}

				WITH_PUSH=${WITH_PUSH:-}

				case ${GPU_ARCH_TYPE} in

				    cpu)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cpu-manylinux_2_28)

				        TARGET=cpu_final

				        DOCKER_TAG=cpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cpu-aarch64)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"

				        MANY_LINUX_VERSION="aarch64"

				        ;;

				    cpu-aarch64-2_28)

				        TARGET=final

				        DOCKER_TAG=cpu-aarch64

				        GPU_IMAGE=arm64v8/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28_aarch64"

				        ;;

				    cpu-cxx11-abi)

				        TARGET=final

				        DOCKER_TAG=cpu-cxx11-abi

				        GPU_IMAGE=""

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"

				        MANY_LINUX_VERSION="cxx11-abi"

				        ;;

				    cpu-s390x)

				        TARGET=final

				        DOCKER_TAG=cpu-s390x

				        GPU_IMAGE=redhat/ubi9

				        DOCKER_GPU_BUILD_ARG=""

				        MANY_LINUX_VERSION="s390x"

				        ;;

				    cuda)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        # Keep this up to date with the minimum version of CUDA we currently support

				        GPU_IMAGE=centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    cuda-manylinux_2_28)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    cuda-aarch64)

				        TARGET=cuda_final

				        DOCKER_TAG=cuda${GPU_ARCH_VERSION}

				        GPU_IMAGE=arm64v8/centos:7

				        DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="aarch64"

				        DOCKERFILE_SUFFIX="_cuda_aarch64"

				        ;;

				    rocm)

				        TARGET=rocm_final

				        DOCKER_TAG=rocm${GPU_ARCH_VERSION}

				        GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete

				        PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"

				        ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"

				        if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then

				            ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))

				        else

				            echo "ERROR: rocm regex failed"

				            exit 1

				        fi

				        if [[ $ROCM_VERSION_INT -ge 60000 ]]; then

				            PYTORCH_ROCM_ARCH+=";gfx942"

				        fi

				        DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=9"

				        ;;

				    xpu)

				        TARGET=xpu_final

				        DOCKER_TAG=xpu

				        GPU_IMAGE=amd64/almalinux:8

				        DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"

				        MANY_LINUX_VERSION="2_28"

				        ;;

				    *)

				        echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"

				        exit 1

				        ;;

				esac

				IMAGES=''

				if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then

				    DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}

				fi

				(

				    set -x

				    # TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712

				    # is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.

				    sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service

				    sudo systemctl daemon-reload

				    sudo systemctl restart docker

				    DOCKER_BUILDKIT=1 docker build  \

				        ${DOCKER_GPU_BUILD_ARG} \

				        --build-arg "GPU_IMAGE=${GPU_IMAGE}" \

				        --target "${TARGET}" \

				        -t "${DOCKER_IMAGE}" \

				        $@ \

				        -f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \

				        "${TOPDIR}/.ci/docker/"

				)

				GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}

				GIT_BRANCH_NAME=${GITHUB_REF##*/}

				GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}

				DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

				DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}

				if [[ "${WITH_PUSH}" == true ]]; then

				    (

				        set -x

				        docker push "${DOCKER_IMAGE}"

				        if [[ -n ${GITHUB_REF} ]]; then

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}

				            docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}

				            docker push "${DOCKER_IMAGE_BRANCH_TAG}"

				            docker push "${DOCKER_IMAGE_SHA_TAG}"

				        fi

				    )

				fi

									
										131

.ci/docker/manywheel/build_scripts/build.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,131 @@

				#!/bin/bash

				# Top-level build script called from Dockerfile

				# Script used only in CD pipeline

				# Stop at any error, show all commands

				set -ex

				# openssl version to build, with expected sha256 hash of .tar.gz

				# archive

				OPENSSL_ROOT=openssl-1.1.1l

				OPENSSL_HASH=0b7a3e5e59c34827fe0c3a74b7ec8baef302b98fa80088d7f9153aa16fa76bd1

				DEVTOOLS_HASH=a8ebeb4bed624700f727179e6ef771dafe47651131a00a78b342251415646acc

				PATCHELF_HASH=d9afdff4baeacfbc64861454f368b7f2c15c44d245293f7587bbf726bfe722fb

				CURL_ROOT=curl-7.73.0

				CURL_HASH=cf34fe0b07b800f1c01a499a6e8b2af548f6d0e044dca4a29d88a4bee146d131

				AUTOCONF_ROOT=autoconf-2.69

				AUTOCONF_HASH=954bd69b391edc12d6a4a51a2dd1476543da5c6bbf05a95b59dc0dd6fd4c2969

				# Get build utilities

				MY_DIR=$(dirname "${BASH_SOURCE[0]}")

				source $MY_DIR/build_utils.sh

				if [ "$(uname -m)" != "s390x" ] ; then

				    # Dependencies for compiling Python that we want to remove from

				    # the final image after compiling Python

				    PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel"

				    # Libraries that are allowed as part of the manylinux1 profile

				    MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel  mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"

				    # Development tools and libraries

				    yum -y install bzip2 make git patch unzip bison yasm diffutils \

				        automake which file cmake28 \

				        kernel-devel-`uname -r` \

				        ${PYTHON_COMPILE_DEPS}

				else

				    # Dependencies for compiling Python that we want to remove from

				    # the final image after compiling Python

				    PYTHON_COMPILE_DEPS="zlib1g-dev libbz2-dev libncurses-dev libsqlite3-dev libdb-dev libpcap-dev liblzma-dev libffi-dev"

				    # Libraries that are allowed as part of the manylinux1 profile

				    MANYLINUX1_DEPS="libglib2.0-dev libX11-dev libncurses-dev"

				    # Development tools and libraries

				    apt install -y bzip2 make git patch unzip diffutils \

				        automake which file cmake \

				        linux-headers-virtual \

				        ${PYTHON_COMPILE_DEPS}

				fi

				# Install newest autoconf

				build_autoconf $AUTOCONF_ROOT $AUTOCONF_HASH

				autoconf --version

				# Compile the latest Python releases.

				# (In order to have a proper SSL module, Python is compiled

				# against a recent openssl [see env vars above], which is linked

				# statically. We delete openssl afterwards.)

				build_openssl $OPENSSL_ROOT $OPENSSL_HASH

				/build_scripts/install_cpython.sh

				PY39_BIN=/opt/python/cp39-cp39/bin

				# Our openssl doesn't know how to find the system CA trust store

				#   (https://github.com/pypa/manylinux/issues/53)

				# And it's not clear how up-to-date that is anyway

				# So let's just use the same one pip and everyone uses

				$PY39_BIN/pip install certifi

				ln -s $($PY39_BIN/python -c 'import certifi; print(certifi.where())') \

				      /opt/_internal/certs.pem

				# If you modify this line you also have to modify the versions in the

				# Dockerfiles:

				export SSL_CERT_FILE=/opt/_internal/certs.pem

				# Install newest curl

				build_curl $CURL_ROOT $CURL_HASH

				rm -rf /usr/local/include/curl /usr/local/lib/libcurl* /usr/local/lib/pkgconfig/libcurl.pc

				hash -r

				curl --version

				curl-config --features

				# Install patchelf (latest with unreleased bug fixes)

				curl -sLOk https://nixos.org/releases/patchelf/patchelf-0.10/patchelf-0.10.tar.gz

				# check_sha256sum patchelf-0.9njs2.tar.gz $PATCHELF_HASH

				tar -xzf patchelf-0.10.tar.gz

				(cd patchelf-0.10 && ./configure && make && make install)

				rm -rf patchelf-0.10.tar.gz patchelf-0.10

				# Install latest pypi release of auditwheel

				$PY39_BIN/pip install auditwheel

				ln -s $PY39_BIN/auditwheel /usr/local/bin/auditwheel

				# Clean up development headers and other unnecessary stuff for

				# final image

				if [ "$(uname -m)" != "s390x" ] ; then

				    yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \

				        avahi freetype bitstream-vera-fonts \

				        ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1

				    yum -y install ${MANYLINUX1_DEPS}

				    yum -y clean all > /dev/null 2>&1

				    yum list installed

				else

				    apt purge -y ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1

				fi

				# we don't need libpython*.a, and they're many megabytes

				find /opt/_internal -name '*.a' -print0 | xargs -0 rm -f

				# Strip what we can -- and ignore errors, because this just attempts to strip

				# *everything*, including non-ELF files:

				find /opt/_internal -type f -print0 \

				    | xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true

				# We do not need the Python test suites, or indeed the precompiled .pyc and

				# .pyo files. Partially cribbed from:

				#    https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile

				find /opt/_internal \

				     \( -type d -a -name test -o -name tests \) \

				  -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \

				  -print0 | xargs -0 rm -f

				for PYTHON in /opt/python/*/bin/python; do

				    # Smoke test to make sure that our Pythons work, and do indeed detect as

				    # being manylinux compatible:

				    $PYTHON $MY_DIR/manylinux1-check.py

				    # Make sure that SSL cert checking works

				    $PYTHON $MY_DIR/ssl-check.py

				done

				# Fix libc headers to remain compatible with C99 compilers.

				find /usr/include/ -type f -exec sed -i 's/\bextern _*inline_*\b/extern __inline __attribute__ ((__gnu_inline__))/g' {} +

				# Now we can delete our built SSL

				rm -rf /usr/local/ssl

									
										91

.ci/docker/manywheel/build_scripts/build_utils.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,91 @@

				#!/bin/bash

				# Helper utilities for build

				# Script used only in CD pipeline

				OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/

				CURL_DOWNLOAD_URL=https://curl.askapache.com/download

				AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

				function check_var {

				    if [ -z "$1" ]; then

				        echo "required variable not defined"

				        exit 1

				    fi

				}

				function do_openssl_build {

				    ./config no-ssl2 no-shared -fPIC --prefix=/usr/local/ssl > /dev/null

				    make > /dev/null

				    make install > /dev/null

				}

				function check_sha256sum {

				    local fname=$1

				    check_var ${fname}

				    local sha256=$2

				    check_var ${sha256}

				    echo "${sha256}  ${fname}" > ${fname}.sha256

				    sha256sum -c ${fname}.sha256

				    rm -f ${fname}.sha256

				}

				function build_openssl {

				    local openssl_fname=$1

				    check_var ${openssl_fname}

				    local openssl_sha256=$2

				    check_var ${openssl_sha256}

				    check_var ${OPENSSL_DOWNLOAD_URL}

				    curl -sLO ${OPENSSL_DOWNLOAD_URL}/${openssl_fname}.tar.gz

				    check_sha256sum ${openssl_fname}.tar.gz ${openssl_sha256}

				    tar -xzf ${openssl_fname}.tar.gz

				    (cd ${openssl_fname} && do_openssl_build)

				    rm -rf ${openssl_fname} ${openssl_fname}.tar.gz

				}

				function do_curl_build {

				    LIBS=-ldl ./configure --with-ssl --disable-shared > /dev/null

				    make > /dev/null

				    make install > /dev/null

				}

				function build_curl {

				    local curl_fname=$1

				    check_var ${curl_fname}

				    local curl_sha256=$2

				    check_var ${curl_sha256}

				    check_var ${CURL_DOWNLOAD_URL}

				    curl -sLO ${CURL_DOWNLOAD_URL}/${curl_fname}.tar.bz2

				    check_sha256sum ${curl_fname}.tar.bz2 ${curl_sha256}

				    tar -jxf ${curl_fname}.tar.bz2

				    (cd ${curl_fname} && do_curl_build)

				    rm -rf ${curl_fname} ${curl_fname}.tar.bz2

				}

				function do_standard_install {

				    ./configure > /dev/null

				    make > /dev/null

				    make install > /dev/null

				}

				function build_autoconf {

				    local autoconf_fname=$1

				    check_var ${autoconf_fname}

				    local autoconf_sha256=$2

				    check_var ${autoconf_sha256}

				    check_var ${AUTOCONF_DOWNLOAD_URL}

				    curl -sLO ${AUTOCONF_DOWNLOAD_URL}/${autoconf_fname}.tar.gz

				    check_sha256sum ${autoconf_fname}.tar.gz ${autoconf_sha256}

				    tar -zxf ${autoconf_fname}.tar.gz

				    (cd ${autoconf_fname} && do_standard_install)

				    rm -rf ${autoconf_fname} ${autoconf_fname}.tar.gz

				}

									
										60

.ci/docker/manywheel/build_scripts/manylinux1-check.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,60 @@

				# Logic copied from PEP 513

				def is_manylinux1_compatible():

				    # Only Linux, and only x86-64 / i686

				    from distutils.util import get_platform

				    if get_platform() not in ["linux-x86_64", "linux-i686", "linux-s390x"]:

				        return False

				    # Check for presence of _manylinux module

				    try:

				        import _manylinux

				        return bool(_manylinux.manylinux1_compatible)

				    except (ImportError, AttributeError):

				        # Fall through to heuristic check below

				        pass

				    # Check glibc version. CentOS 5 uses glibc 2.5.

				    return have_compatible_glibc(2, 5)

				def have_compatible_glibc(major, minimum_minor):

				    import ctypes

				    process_namespace = ctypes.CDLL(None)

				    try:

				        gnu_get_libc_version = process_namespace.gnu_get_libc_version

				    except AttributeError:

				        # Symbol doesn't exist -> therefore, we are not linked to

				        # glibc.

				        return False

				    # Call gnu_get_libc_version, which returns a string like "2.5".

				    gnu_get_libc_version.restype = ctypes.c_char_p

				    version_str = gnu_get_libc_version()

				    # py2 / py3 compatibility:

				    if not isinstance(version_str, str):

				        version_str = version_str.decode("ascii")

				    # Parse string and check against requested version.

				    version = [int(piece) for piece in version_str.split(".")]

				    assert len(version) == 2

				    if major != version[0]:

				        return False

				    if minimum_minor > version[1]:

				        return False

				    return True

				import sys

				if is_manylinux1_compatible():

				    print(f"{sys.executable} is manylinux1 compatible")

				    sys.exit(0)

				else:

				    print(f"{sys.executable} is NOT manylinux1 compatible")

				    sys.exit(1)

									
										31

.ci/docker/manywheel/build_scripts/ssl-check.py
									
										Normal file
									
												View File
												
				@ -0,0 +1,31 @@

				# cf. https://github.com/pypa/manylinux/issues/53

				import sys

				from urllib.request import urlopen

				GOOD_SSL = "https://google.com"

				BAD_SSL = "https://self-signed.badssl.com"

				print("Testing SSL certificate checking for Python:", sys.version)

				if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):

				    print("This version never checks SSL certs; skipping tests")

				    sys.exit(0)

				EXC = OSError

				print(f"Connecting to {GOOD_SSL} should work")

				urlopen(GOOD_SSL)

				print("...it did, yay.")

				print(f"Connecting to {BAD_SSL} should fail")

				try:

				    urlopen(BAD_SSL)

				    # If we get here then we failed:

				    print("...it DIDN'T!!!!!11!!1one!")

				    sys.exit(1)

				except EXC:

				    print("...it did, yay.")

76

.ci/docker/requirements-ci.txt

View File

 @ -5,7 +5,7 @@
 #Pinned versions: 1.6
 #test that import:
 boto3==1.19.12
 boto3==1.35.42
 #Description: AWS SDK for python
 #Pinned versions: 1.19.12, 1.16.34
 #test that import:
 @ -30,9 +30,14 @@ dill==0.3.7
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.1.6
 expecttest==0.2.1
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 #Pinned versions: 0.2.1
 #test that import:
 fbscribelogger==0.1.6
 #Description: write to scribe from authenticated jobs on CI
 #Pinned versions: 0.1.6
 #test that import:
 @ -85,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.9.0
 mypy==1.11.2
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.9.0
 #Pinned versions: 1.10.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -104,7 +109,7 @@ networkx==2.8.8
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 numba==0.54.1 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.10"
 #Description: Just-In-Time Compiler for Numerical Functions
 #Pinned versions: 0.54.1, 0.49.0, <=0.49.1
 @ -113,7 +118,7 @@ numba==0.55.2 ; python_version == "3.10"
 #numpy
 #Description: Provides N-dimensional arrays and linear algebra
 #Pinned versions: 1.20
 #Pinned versions: 1.26.2
 #test that import: test_view_ops.py, test_unary_ufuncs.py, test_type_promotion.py,
 #test_type_info.py, test_torch.py, test_tensorexpr_pybind.py, test_tensorexpr.py,
 #test_tensorboard.py, test_tensor_creation_ops.py, test_static_runtime.py,
 @ -123,6 +128,10 @@ numba==0.55.2 ; python_version == "3.10"
 #test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
 #test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
 #test_binary_ufuncs.py
 numpy==1.21.2; python_version == "3.9"
 numpy==1.22.4; python_version == "3.10"
 numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
 numpy==2.1.2; python_version >= "3.13"
 #onnxruntime
 #Description: scoring engine for Open Neural Network Exchange (ONNX) models
 @ -134,9 +143,9 @@ opt-einsum==3.3
 #Pinned versions: 3.3
 #test that import: test_linalg.py
 optree==0.11.0
 optree==0.13.0
 #Description: A library for tree manipulation
 #Pinned versions: 0.11.0
 #Pinned versions: 0.13.0
 #test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
 #test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
 #common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
 @ -218,7 +227,7 @@ pygments==2.15.0
 #test that import:
 scikit-image==0.19.3 ; python_version < "3.10"
 scikit-image==0.20.0 ; python_version >= "3.10"
 scikit-image==0.22.0 ; python_version >= "3.10"
 #Description: image processing routines
 #Pinned versions:
 #test that import: test_nn.py
 @ -269,6 +278,10 @@ lintrunner==0.12.5
 #Pinned versions: 0.12.5
 #test that import:
 redis>=4.0.0
 #Description: redis database
 #test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
 rockset==1.0.3
 #Description: queries Rockset
 #Pinned versions: 1.0.3
 @ -306,9 +319,52 @@ pywavelets==1.5.0 ; python_version >= "3.12"
 #Pinned versions: 1.4.1
 #test that import:
 lxml==5.0.0.
 lxml==5.0.0
 #Description: This is a requirement of unittest-xml-reporting
 # Python-3.9 binaries
 PyGithub==2.3.0
 sympy==1.13.1 ; python_version >= "3.9"
 #Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
 #Pinned versions:
 #test that import:
 onnx==1.16.1
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 onnxscript==0.1.0.dev20240817
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 parameterized==0.8.1
 #Description: Parameterizes unittests, both the tests themselves and the entire testing class
 #Pinned versions:
 #test that import:
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 1.24.0
 #test that import: test_sac_estimator.py
 pwlf==2.2.1 ; python_version >= "3.8"
 #Description: required for testing torch/distributed/_tools/sac_estimator.py
 #Pinned versions: 2.2.1
 #test that import: test_sac_estimator.py
 # To build PyTorch itself
 astunparse
 PyYAML
 setuptools
 ninja==1.11.1 ; platform_machine == "aarch64"
 scons==4.5.2 ; platform_machine == "aarch64"
 pulp==2.9.0 ; python_version >= "3.8"
 #Description: required for testing ilp formulaiton under torch/distributed/_tools
 #Pinned versions: 2.9.0
 #test that import: test_sac_ilp.py

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .0.0
 .1.0

									
										14

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -103,6 +103,14 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				ARG HALIDE

				# Build and install halide

				COPY ./common/install_halide.sh install_halide.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/halide.txt halide.txt

				RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi

				RUN rm install_halide.sh common_utils.sh halide.txt

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				@ -148,6 +156,12 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Install CUDSS

				ARG CUDA_VERSION

				COPY ./common/install_cudss.sh install_cudss.sh

				RUN bash install_cudss.sh

				RUN rm install_cudss.sh

				# Delete /usr/local/cuda-11.X/cuda-11.X symlinks

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

									
										9

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -68,6 +68,8 @@ RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN rm install_rocm_magma.sh

				ADD ./common/install_miopen.sh install_miopen.sh

				RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				ENV ROCM_PATH /opt/rocm

				ENV PATH /opt/rocm/bin:$PATH

				ENV PATH /opt/rocm/hcc/bin:$PATH

				@ -100,10 +102,10 @@ ARG TRITON

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

				@ -121,5 +123,8 @@ RUN bash ./install_cache.sh && rm install_cache.sh

				ARG BUILD_ENVIRONMENT

				ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

				# Install LLVM dev version (Defined in the pytorch/builder github repository)

				COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				USER jenkins

				CMD ["bash"]

									
										1

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -30,6 +30,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ARG DOCS

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ENV DOCS=$DOCS

									
										17

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -50,7 +50,7 @@ RUN  bash ./install_lcov.sh && rm install_lcov.sh

				# Install cuda and cudnn

				ARG CUDA_VERSION

				RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

				@ -147,6 +147,13 @@ COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				ARG TRITON_CPU

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt

				RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-cpu.txt

				ARG EXECUTORCH

				# Build and install executorch

				COPY ./common/install_executorch.sh install_executorch.sh

				@ -155,6 +162,14 @@ COPY ci_commit_pins/executorch.txt executorch.txt

				RUN if [ -n "${EXECUTORCH}" ]; then bash ./install_executorch.sh; fi

				RUN rm install_executorch.sh common_utils.sh executorch.txt

				ARG HALIDE

				# Build and install halide

				COPY ./common/install_halide.sh install_halide.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/halide.txt halide.txt

				RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi

				RUN rm install_halide.sh common_utils.sh halide.txt

				ARG ONNX

				# Install ONNX dependencies

				COPY ./common/install_onnx.sh ./common/common_utils.sh ./

									
										10

.ci/libtorch/build.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,10 @@

				#!/usr/bin/env bash

				# This is mostly just a shim to manywheel/build.sh

				# TODO: Make this a dedicated script to build just libtorch

				set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

21

.ci/manywheel/LICENSE Normal file

View File

 @ -0,0 +1,21 @@
 The MIT License (MIT)
 Copyright (c) 2016 manylinux
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.

									
										25

.ci/manywheel/build.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,25 @@

				#!/usr/bin/env bash

				set -ex

				SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

				case "${GPU_ARCH_TYPE:-BLANK}" in

				    BLANK)

				        # Legacy behavior for CircleCI

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    cuda)

				        bash "${SCRIPTPATH}/build_cuda.sh"

				        ;;

				    rocm)

				        bash "${SCRIPTPATH}/build_rocm.sh"

				        ;;

				    cpu | cpu-cxx11-abi | cpu-s390x | xpu)

				        bash "${SCRIPTPATH}/build_cpu.sh"

				        ;;

				    *)

				        echo "Un-recognized GPU_ARCH_TYPE '${GPU_ARCH_TYPE}', exiting..."

				        exit 1

				        ;;

				esac

									
										505

.ci/manywheel/build_common.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,505 @@

				#!/usr/bin/env bash

				# meant to be called only from the neighboring build.sh and build_cpu.sh scripts

				set -ex

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				# Require only one python installation

				if [[ -z "$DESIRED_PYTHON" ]]; then

				    echo "Need to set DESIRED_PYTHON env variable"

				    exit 1

				fi

				if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then

				    echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"

				    echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"

				    exit 1

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# TODO move this into the Docker images

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    retry dnf install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # TODO: Remove this once nvidia package repos are back online

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				fi

				# We use the package name to test the package by passing this to 'pip install'

				# This is the env variable that setup.py uses to name the package. Note that

				# pip 'normalizes' the name first by changing all - to _

				if [[ -z "$TORCH_PACKAGE_NAME" ]]; then

				    TORCH_PACKAGE_NAME='torch'

				fi

				if [[ -z "$TORCH_NO_PYTHON_PACKAGE_NAME" ]]; then

				    TORCH_NO_PYTHON_PACKAGE_NAME='torch_no_python'

				fi

				TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"

				TORCH_NO_PYTHON_PACKAGE_NAME="$(echo $TORCH_NO_PYTHON_PACKAGE_NAME | tr '-' '_')"

				echo "Expecting the built wheels to all be called '$TORCH_PACKAGE_NAME' or '$TORCH_NO_PYTHON_PACKAGE_NAME'"

				# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

				# PYTORCH_BUILD_NUMBER > 1

				build_version="$PYTORCH_BUILD_VERSION"

				build_number="$PYTORCH_BUILD_NUMBER"

				if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then

				    # This will be the *exact* version, since build_number<1

				    build_version="$OVERRIDE_PACKAGE_VERSION"

				    build_number=0

				fi

				if [[ -z "$build_version" ]]; then

				    build_version=1.0.0

				fi

				if [[ -z "$build_number" ]]; then

				    build_number=1

				fi

				export PYTORCH_BUILD_VERSION=$build_version

				export PYTORCH_BUILD_NUMBER=$build_number

				export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"

				export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"

				if [[ -e /opt/openssl ]]; then

				    export OPENSSL_ROOT_DIR=/opt/openssl

				    export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH

				fi

				# If given a python version like 3.6m or 2.7mu, convert this to the format we

				# expect. The binary CI jobs pass in python versions like this; they also only

				# ever pass one python version, so we assume that DESIRED_PYTHON is not a list

				# in this case

				if [[ -n "$DESIRED_PYTHON" && $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then

				    python_digits="$(echo $DESIRED_PYTHON | tr -cd [:digit:])"

				    py_majmin="${DESIRED_PYTHON}"

				    DESIRED_PYTHON="cp${python_digits}-cp${python_digits}t"

				elif [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then

				    python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"

				    DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"

				    if [[ ${python_nodot} -ge 310 ]]; then

				        py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:2}"

				    else

				        py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:1}"

				    fi

				fi

				pydir="/opt/python/$DESIRED_PYTHON"

				export PATH="$pydir/bin:$PATH"

				echo "Will build for Python version: ${DESIRED_PYTHON} with ${python_installation}"

				mkdir -p /tmp/$WHEELHOUSE_DIR

				export PATCHELF_BIN=/usr/local/bin/patchelf

				patchelf_version=$($PATCHELF_BIN --version)

				echo "patchelf version: " $patchelf_version

				if [[ "$patchelf_version" == "patchelf 0.9" ]]; then

				    echo "Your patchelf version is too old. Please use version >= 0.10."

				    exit 1

				fi

				########################################################

				# Compile wheels as well as libtorch

				#######################################################

				if [[ -z "$PYTORCH_ROOT" ]]; then

				    echo "Need to set PYTORCH_ROOT env variable"

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				python setup.py clean

				retry pip install -qr requirements.txt

				case ${DESIRED_PYTHON} in

				  cp31*)

				    retry pip install -q --pre numpy==2.1.0

				    ;;

				  # Should catch 3.9+

				  *)

				    retry pip install -q --pre numpy==2.0.2

				    ;;

				esac

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				fi

				# This value comes from binary_linux_build.sh (and should only be set to true

				# for master / release branches)

				BUILD_DEBUG_INFO=${BUILD_DEBUG_INFO:=0}

				if [[ $BUILD_DEBUG_INFO == "1" ]]; then

				    echo "Building wheel and debug info"

				else

				    echo "BUILD_DEBUG_INFO was not set, skipping debug info"

				fi

				if [[ "$DISABLE_RCCL" = 1 ]]; then

				    echo "Disabling NCCL/RCCL in pyTorch"

				    USE_RCCL=0

				    USE_NCCL=0

				    USE_KINETO=0

				else

				    USE_RCCL=1

				    USE_NCCL=1

				    USE_KINETO=1

				fi

				echo "Calling setup.py bdist at $(date)"

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				    echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"

				    echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				    time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				    BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \

				    BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				    USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				    python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake

				    echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"

				else

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \

				        USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \

				        python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR

				fi

				echo "Finished setup.py bdist at $(date)"

				# Build libtorch packages

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    # Now build pythonless libtorch

				    # Note - just use whichever python we happen to be on

				    python setup.py clean

				    if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				        STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

				    fi

				    mkdir -p build

				    pushd build

				    echo "Calling tools/build_libtorch.py at $(date)"

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				         EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \

				         python ../tools/build_libtorch.py

				    echo "Finished tools/build_libtorch.py at $(date)"

				    popd

				    mkdir -p libtorch/{lib,bin,include,share}

				    cp -r build/build/lib libtorch/

				    # for now, the headers for the libtorch package will just be copied in

				    # from one of the wheels (this is from when this script built multiple

				    # wheels at once)

				    ANY_WHEEL=$(ls /tmp/$WHEELHOUSE_DIR/torch*.whl | head -n1)

				    unzip -d any_wheel $ANY_WHEEL

				    if [[ -d any_wheel/torch/include ]]; then

				        cp -r any_wheel/torch/include libtorch/

				    else

				        cp -r any_wheel/torch/lib/include libtorch/

				    fi

				    cp -r any_wheel/torch/share/cmake libtorch/share/

				    rm -rf any_wheel

				    echo $PYTORCH_BUILD_VERSION > libtorch/build-version

				    echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				        LIBTORCH_ABI="cxx11-abi-"

				    else

				        LIBTORCH_ABI=

				    fi

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

				fi

				popd

				#######################################################################

				# ADD DEPENDENCIES INTO THE WHEEL

				#

				# auditwheel repair doesn't work correctly and is buggy

				# so manually do the work of copying dependency libs and patchelfing

				# and fixing RECORDS entries correctly

				######################################################################

				fname_with_sha256() {

				    HASH=$(sha256sum $1 | cut -c1-8)

				    DIRNAME=$(dirname $1)

				    BASENAME=$(basename $1)

				    # Do not rename nvrtc-builtins.so as they are dynamically loaded

				    # by libnvrtc.so

				    # Similarly don't mangle libcudnn and libcublas library names

				    if [[ $BASENAME == "libnvrtc-builtins.s"* || $BASENAME == "libcudnn"* || $BASENAME == "libcublas"*  ]]; then

				        echo $1

				    else

				        INITNAME=$(echo $BASENAME | cut -f1 -d".")

				        ENDNAME=$(echo $BASENAME | cut -f 2- -d".")

				        echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"

				    fi

				}

				fname_without_so_number() {

				    LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')

				    echo "$LINKNAME"

				}

				make_wheel_record() {

				    FPATH=$1

				    if echo $FPATH | grep RECORD >/dev/null 2>&1; then

				        # if the RECORD file, then

				        echo "$FPATH,,"

				    else

				        HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')

				        FSIZE=$(ls -nl $FPATH | awk '{print $5}')

				        echo "$FPATH,sha256=$HASH,$FSIZE"

				    fi

				}

				replace_needed_sofiles() {

				    find $1 -name '*.so*' | while read sofile; do

				        origname=$2

				        patchedname=$3

				        if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				            set +e

				            origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")

				            ERRCODE=$?

				            set -e

				            if [ "$ERRCODE" -eq "0" ]; then

				                echo "patching $sofile entry $origname to $patchedname"

				                $PATCHELF_BIN --replace-needed $origname $patchedname $sofile

				            fi

				        fi

				    done

				}

				echo 'Built this wheel:'

				ls /tmp/$WHEELHOUSE_DIR

				mkdir -p "/$WHEELHOUSE_DIR"

				mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/

				if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true

				fi

				if [[ -n "$BUILD_PYTHONLESS" ]]; then

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				    rm -rf /tmp/$LIBTORCH_HOUSE_DIR

				fi

				rm -rf /tmp/$WHEELHOUSE_DIR

				rm -rf /tmp_dir

				mkdir /tmp_dir

				pushd /tmp_dir

				for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.whl /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do

				    # if the glob didn't match anything

				    if [[ ! -e $pkg ]]; then

				        continue

				    fi

				    rm -rf tmp

				    mkdir -p tmp

				    cd tmp

				    cp $pkg .

				    unzip -q $(basename $pkg)

				    rm -f $(basename $pkg)

				    if [[ -d torch ]]; then

				        PREFIX=torch

				    else

				        PREFIX=libtorch

				    fi

				    if [[ $pkg != *"without-deps"* ]]; then

				        # copy over needed dependent .so files over and tag them with their hash

				        patched=()

				        for filepath in "${DEPS_LIST[@]}"; do

				            filename=$(basename $filepath)

				            destpath=$PREFIX/lib/$filename

				            if [[ "$filepath" != "$destpath" ]]; then

				                cp $filepath $destpath

				            fi

				            # ROCm workaround for roctracer dlopens

				            if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                patchedpath=$(fname_without_so_number $destpath)

				            # Keep the so number for XPU dependencies

				            elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then

				                patchedpath=$destpath

				            else

				                patchedpath=$(fname_with_sha256 $destpath)

				            fi

				            patchedname=$(basename $patchedpath)

				            if [[ "$destpath" != "$patchedpath" ]]; then

				                mv $destpath $patchedpath

				            fi

				            patched+=("$patchedname")

				            echo "Copied $filepath to $patchedpath"

				        done

				        echo "patching to fix the so names to the hashed names"

				        for ((i=0;i<${#DEPS_LIST[@]};++i)); do

				            replace_needed_sofiles $PREFIX ${DEPS_SONAME[i]} ${patched[i]}

				            # do the same for caffe2, if it exists

				            if [[ -d caffe2 ]]; then

				                replace_needed_sofiles caffe2 ${DEPS_SONAME[i]} ${patched[i]}

				            fi

				        done

				        # copy over needed auxiliary files

				        for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do

				            srcpath=${DEPS_AUX_SRCLIST[i]}

				            dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}

				            mkdir -p $(dirname $dstpath)

				            cp $srcpath $dstpath

				        done

				    fi

				    # set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib

				    find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'}"

				        $PATCHELF_BIN --set-rpath ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # set RPATH of lib/ files to $ORIGIN

				    find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to ${LIB_SO_RPATH:-'$ORIGIN'}"

				        $PATCHELF_BIN --set-rpath ${LIB_SO_RPATH:-'$ORIGIN'} ${FORCE_RPATH:-} $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # regenerate the RECORD file with new hashes

				    record_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g')

				    if [[ -e $record_file ]]; then

				        echo "Generating new record file $record_file"

				        : > "$record_file"

				        # generate records for folders in wheel

				        find * -type f | while read fname; do

				            make_wheel_record "$fname" >>"$record_file"

				        done

				    fi

				    if [[ $BUILD_DEBUG_INFO == "1" ]]; then

				        pushd "$PREFIX/lib"

				        # Duplicate library into debug lib

				        cp libtorch_cpu.so libtorch_cpu.so.dbg

				        # Keep debug symbols on debug lib

				        strip --only-keep-debug libtorch_cpu.so.dbg

				        # Remove debug info from release lib

				        strip --strip-debug libtorch_cpu.so

				        objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg

				        # Zip up debug info

				        mkdir -p /tmp/debug

				        mv libtorch_cpu.so.dbg /tmp/debug/libtorch_cpu.so.dbg

				        CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch_cpu.so)

				        pushd /tmp

				        PKG_NAME=$(basename "$pkg" | sed 's/\.whl$//g')

				        zip /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip /tmp/debug/libtorch_cpu.so.dbg

				        cp /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip "$PYTORCH_FINAL_PACKAGE_DIR"

				        popd

				        popd

				    fi

				    # zip up the wheel back

				    zip -rq $(basename $pkg) $PREIX*

				    # replace original wheel

				    rm -f $pkg

				    mv $(basename $pkg) $pkg

				    cd ..

				    rm -rf tmp

				done

				# Copy wheels to host machine for persistence before testing

				if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				    if [[ -n "$BUILD_PYTHONLESS" ]]; then

				        cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				    else

				        cp /$WHEELHOUSE_DIR/torch*.whl "$PYTORCH_FINAL_PACKAGE_DIR"

				    fi

				fi

				# remove stuff before testing

				rm -rf /opt/rh

				if ls /usr/local/cuda* >/dev/null 2>&1; then

				    rm -rf /usr/local/cuda*

				fi

				# Test that all the wheels work

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				  export OMP_NUM_THREADS=4 # on NUMA machines this takes too long

				  pushd $PYTORCH_ROOT/test

				  # Install the wheel for this Python version

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true

				  fi

				  pip uninstall -y "$TORCH_PACKAGE_NAME"

				  if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				    pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  fi

				  pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v

				  # Print info on the libraries installed in this wheel

				  # Rather than adjust find command to skip non-library files with an embedded *.so* in their name,

				  # since this is only for reporting purposes, we add the || true to the ldd command.

				  installed_libraries=($(find "$pydir/lib/python${py_majmin}/site-packages/torch/" -name '*.so*'))

				  echo "The wheel installed all of the libraries: ${installed_libraries[@]}"

				  for installed_lib in "${installed_libraries[@]}"; do

				      ldd "$installed_lib" || true

				  done

				  # Run the tests

				  echo "$(date) :: Running tests"

				  pushd "$PYTORCH_ROOT"

				  #TODO: run_tests.sh and check_binary.sh should be moved to pytorch/pytorch project

				  LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \

				          "/builder/run_tests.sh" manywheel "${py_majmin}" "$DESIRED_CUDA"

				  popd

				  echo "$(date) :: Finished tests"

				fi

									
										99

.ci/manywheel/build_cpu.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,99 @@

				#!/usr/bin/env bash

				set -ex

				GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}

				export TH_BINARY_BUILD=1

				export USE_CUDA=0

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				DIR_SUFFIX=cpu

				if [[ "$GPU_ARCH_TYPE" == "xpu" ]]; then

				    DIR_SUFFIX=xpu

				    # Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				    source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh

				    source /opt/intel/oneapi/pti/latest/env/vars.sh

				    export USE_STATIC_MKL=1

				fi

				WHEELHOUSE_DIR="wheelhouse$DIR_SUFFIX"

				LIBTORCH_HOUSE_DIR="libtorch_house$DIR_SUFFIX"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$DIR_SUFFIX"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$DIR_SUFFIX"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    if [[ "$(uname -m)" == "s390x" ]]; then

				        LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"

				    else

				        LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				    fi

				fi

				DEPS_LIST=(

				    "$LIBGOMP_PATH"

				)

				DEPS_SONAME=(

				    "libgomp.so.1"

				)

				if [[ "$GPU_ARCH_TYPE" == "xpu" ]]; then

				    echo "Bundling with xpu support package libs."

				    DEPS_LIST+=(

				        "/opt/intel/oneapi/compiler/latest/lib/libsycl-preview.so.7"

				        "/opt/intel/oneapi/compiler/latest/lib/libOpenCL.so.1"

				        "/opt/intel/oneapi/compiler/latest/lib/libxptifw.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libsvml.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libirng.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libimf.so"

				        "/opt/intel/oneapi/compiler/latest/lib/libintlc.so.5"

				        "/opt/intel/oneapi/compiler/latest/lib/libpi_level_zero.so"

				        "/opt/intel/oneapi/pti/latest/lib/libpti_view.so.0.9"

				        "/opt/intel/oneapi/pti/latest/lib/libpti.so.0.9"

				    )

				    DEPS_SONAME+=(

				        "libsycl-preview.so.7"

				        "libOpenCL.so.1"

				        "libxptifw.so"

				        "libsvml.so"

				        "libirng.so"

				        "libimf.so"

				        "libintlc.so.5"

				        "libpi_level_zero.so"

				        "libpti_view.so.0.9"

				        "libpti.so.0.9"

				    )

				fi

				rm -rf /usr/local/cuda*

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source ${SOURCE_DIR}/${BUILD_SCRIPT}

									
										290

.ci/manywheel/build_cuda.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,290 @@

				#!/usr/bin/env bash

				set -ex

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P ))"

				export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"

				export NCCL_ROOT_DIR=/usr/local/cuda

				export TH_BINARY_BUILD=1

				export USE_STATIC_CUDNN=1

				export USE_STATIC_NCCL=1

				export ATEN_STATIC_CUDA=1

				export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				export USE_CUPTI_SO=0

				export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				# Determine CUDA version and architectures to build for

				#

				# NOTE: We should first check `DESIRED_CUDA` when determining `CUDA_VERSION`,

				# because in some cases a single Docker image can have multiple CUDA versions

				# on it, and `nvcc --version` might not show the CUDA version we want.

				if [[ -n "$DESIRED_CUDA" ]]; then

				    # If the DESIRED_CUDA already matches the format that we expect

				    if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then

				        CUDA_VERSION=${DESIRED_CUDA}

				    else

				        # cu90, cu92, cu100, cu101

				        if [[ ${#DESIRED_CUDA} -eq 4 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"

				        elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then

				            CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"

				        fi

				    fi

				    echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"

				    # There really has to be a better way to do this - eli

				    # Possibly limiting builds to specific cuda versions be delimiting images would be a choice

				    if [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				        echo "Switching to CUDA version ${DESIRED_CUDA}"

				        /builder/conda/switch_cuda_version.sh "${DESIRED_CUDA}"

				    fi

				else

				    CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")

				    echo "CUDA $CUDA_VERSION Detected"

				fi

				cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

				TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"

				case ${CUDA_VERSION} in

				    12.4)

				        if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then

				            TORCH_CUDA_ARCH_LIST="9.0"

				        else

				            TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"

				        fi

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    12.1)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    11.8)

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    11.[67])

				        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7"

				        EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")

				        ;;

				    *)

				        echo "unknown cuda version $CUDA_VERSION"

				        exit 1

				        ;;

				esac

				export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}

				echo "${TORCH_CUDA_ARCH_LIST}"

				# Package directories

				WHEELHOUSE_DIR="wheelhouse$cuda_version_nodot"

				LIBTORCH_HOUSE_DIR="libtorch_house$cuda_version_nodot"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$cuda_version_nodot"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$cuda_version_nodot"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				fi

				DEPS_LIST=(

				    "$LIBGOMP_PATH"

				)

				DEPS_SONAME=(

				    "libgomp.so.1"

				)

				if [[ $USE_CUSPARSELT == "1" ]]; then

				        DEPS_SONAME+=(

				            "libcusparseLt.so.0"

				        )

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcusparseLt.so.0"

				        )

				fi

				if [[ $CUDA_VERSION == "12.1" || $CUDA_VERSION == "12.4" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9"

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.12"

				            "/usr/local/cuda/lib64/libcublasLt.so.12"

				            "/usr/local/cuda/lib64/libcudart.so.12"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.12"

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				            "libcudnn_cnn.so.9"

				            "libcudnn_graph.so.9"

				            "libcudnn_ops.so.9"

				            "libcudnn_engines_runtime_compiled.so.9"

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.12"

				            "libcublasLt.so.12"

				            "libcudart.so.12"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.12"

				            "libnvrtc-builtins.so"

				        )

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				elif [[ $CUDA_VERSION == "11.8" ]]; then

				    export USE_STATIC_CUDNN=0

				    # Try parallelizing nvcc as well

				    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"

				    # Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750

				    export BUILD_BUNDLE_PTXAS=1

				    if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then

				        echo "Bundling with cudnn and cublas."

				        DEPS_LIST+=(

				            "/usr/local/cuda/lib64/libcudnn_adv.so.9"

				            "/usr/local/cuda/lib64/libcudnn_cnn.so.9"

				            "/usr/local/cuda/lib64/libcudnn_graph.so.9"

				            "/usr/local/cuda/lib64/libcudnn_ops.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"

				            "/usr/local/cuda/lib64/libcudnn_heuristic.so.9"

				            "/usr/local/cuda/lib64/libcudnn.so.9"

				            "/usr/local/cuda/lib64/libcublas.so.11"

				            "/usr/local/cuda/lib64/libcublasLt.so.11"

				            "/usr/local/cuda/lib64/libcudart.so.11.0"

				            "/usr/local/cuda/lib64/libnvToolsExt.so.1"

				            "/usr/local/cuda/lib64/libnvrtc.so.11.2"    # this is not a mistake, it links to more specific cuda version

				            "/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"

				        )

				        DEPS_SONAME+=(

				            "libcudnn_adv.so.9"

				            "libcudnn_cnn.so.9"

				            "libcudnn_graph.so.9"

				            "libcudnn_ops.so.9"

				            "libcudnn_engines_runtime_compiled.so.9"

				            "libcudnn_engines_precompiled.so.9"

				            "libcudnn_heuristic.so.9"

				            "libcudnn.so.9"

				            "libcublas.so.11"

				            "libcublasLt.so.11"

				            "libcudart.so.11.0"

				            "libnvToolsExt.so.1"

				            "libnvrtc.so.11.2"

				            "libnvrtc-builtins.so.11.8"

				        )

				    else

				        echo "Using nvidia libs from pypi."

				        CUDA_RPATHS=(

				            '$ORIGIN/../../nvidia/cublas/lib'

				            '$ORIGIN/../../nvidia/cuda_cupti/lib'

				            '$ORIGIN/../../nvidia/cuda_nvrtc/lib'

				            '$ORIGIN/../../nvidia/cuda_runtime/lib'

				            '$ORIGIN/../../nvidia/cudnn/lib'

				            '$ORIGIN/../../nvidia/cufft/lib'

				            '$ORIGIN/../../nvidia/curand/lib'

				            '$ORIGIN/../../nvidia/cusolver/lib'

				            '$ORIGIN/../../nvidia/cusparse/lib'

				            '$ORIGIN/../../nvidia/nccl/lib'

				            '$ORIGIN/../../nvidia/nvtx/lib'

				        )

				        CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")

				        export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'

				        export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

				        export FORCE_RPATH="--force-rpath"

				        export USE_STATIC_NCCL=0

				        export USE_SYSTEM_NCCL=1

				        export ATEN_STATIC_CUDA=0

				        export USE_CUDA_STATIC_LINK=0

				        export USE_CUPTI_SO=1

				        export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"

				        export NCCL_LIB_DIR="/usr/local/cuda/lib64/"

				    fi

				else

				    echo "Unknown cuda version $CUDA_VERSION"

				    exit 1

				fi

				# builder/test.sh requires DESIRED_CUDA to know what tests to exclude

				export DESIRED_CUDA="$cuda_version_nodot"

				# Switch `/usr/local/cuda` to the desired CUDA version

				rm -rf /usr/local/cuda || true

				ln -s "/usr/local/cuda-${CUDA_VERSION}" /usr/local/cuda

				# Switch `/usr/local/magma` to the desired CUDA version

				rm -rf /usr/local/magma || true

				ln -s /usr/local/cuda-${CUDA_VERSION}/magma /usr/local/magma

				export CUDA_VERSION=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev) # 10.0.130

				export CUDA_VERSION_SHORT=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev | cut -f1,2 -d".") # 10.0

				export CUDNN_VERSION=$(ls /usr/local/cuda/lib64/libcudnn.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev)

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source $SCRIPTPATH/${BUILD_SCRIPT}

									
										353

.ci/manywheel/build_libtorch.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,353 @@

				#!/usr/bin/env bash

				# meant to be called only from the neighboring build.sh and build_cpu.sh scripts

				set -e pipefail

				SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

				# Require only one python installation

				if [[ -z "$DESIRED_PYTHON" ]]; then

				    echo "Need to set DESIRED_PYTHON env variable"

				    exit 1

				fi

				if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then

				    echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"

				    echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"

				    exit 1

				fi

				# Function to retry functions that sometimes timeout or have flaky failures

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# TODO move this into the Docker images

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then

				    retry yum install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then

				    retry dnf install -q -y zip openssl

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    # TODO: Remove this once nvidia package repos are back online

				    # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				    # shellcheck disable=SC2046

				    sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				    retry apt-get update

				    retry apt-get -y install zip openssl

				fi

				# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

				# PYTORCH_BUILD_NUMBER > 1

				build_version="$PYTORCH_BUILD_VERSION"

				build_number="$PYTORCH_BUILD_NUMBER"

				if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then

				    # This will be the *exact* version, since build_number<1

				    build_version="$OVERRIDE_PACKAGE_VERSION"

				    build_number=0

				fi

				if [[ -z "$build_version" ]]; then

				    build_version=1.0.0

				fi

				if [[ -z "$build_number" ]]; then

				    build_number=1

				fi

				export PYTORCH_BUILD_VERSION=$build_version

				export PYTORCH_BUILD_NUMBER=$build_number

				export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"

				export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"

				# set OPENSSL_ROOT_DIR=/opt/openssl if it exists

				if [[ -e /opt/openssl ]]; then

				    export OPENSSL_ROOT_DIR=/opt/openssl

				    export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH

				fi

				# If given a python version like 3.6m or 2.7mu, convert this to the format we

				# expect. The binary CI jobs pass in python versions like this; they also only

				# ever pass one python version, so we assume that DESIRED_PYTHON is not a list

				# in this case

				if [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then

				    python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"

				    DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"

				fi

				pydir="/opt/python/$DESIRED_PYTHON"

				export PATH="$pydir/bin:$PATH"

				export PATCHELF_BIN=/usr/local/bin/patchelf

				patchelf_version=`$PATCHELF_BIN --version`

				echo "patchelf version: " $patchelf_version

				if [[ "$patchelf_version" == "patchelf 0.9" ]]; then

				    echo "Your patchelf version is too old. Please use version >= 0.10."

				    exit 1

				fi

				########################################################

				# Compile wheels as well as libtorch

				#######################################################

				if [[ -z "$PYTORCH_ROOT" ]]; then

				    echo "Need to set PYTORCH_ROOT env variable"

				    exit 1

				fi

				pushd "$PYTORCH_ROOT"

				python setup.py clean

				retry pip install -qr requirements.txt

				retry pip install -q numpy==2.0.1

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    export _GLIBCXX_USE_CXX11_ABI=1

				else

				    export _GLIBCXX_USE_CXX11_ABI=0

				fi

				if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				    echo "Calling build_amd.py at $(date)"

				    python tools/amd_build/build_amd.py

				    # TODO remove this work-around once pytorch sources are updated

				    export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr

				fi

				echo "Calling setup.py install at $(date)"

				if [[ $LIBTORCH_VARIANT = *"static"* ]]; then

				    STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

				fi

				(

				    set -x

				    mkdir -p build

				    time CMAKE_ARGS=${CMAKE_ARGS[@]} \

				        EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \

				        # TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed

				        CFLAGS='-Wno-deprecated-declarations' \

				        BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \

				        python setup.py install

				    mkdir -p libtorch/{lib,bin,include,share}

				    # Make debug folder separate so it doesn't get zipped up with the rest of

				    # libtorch

				    mkdir debug

				    # Copy over all lib files

				    cp -rv build/lib/*                libtorch/lib/

				    cp -rv build/lib*/torch/lib/*     libtorch/lib/

				    # Copy over all include files

				    cp -rv build/include/*            libtorch/include/

				    cp -rv build/lib*/torch/include/* libtorch/include/

				    # Copy over all of the cmake files

				    cp -rv build/lib*/torch/share/*   libtorch/share/

				    # Split libtorch into debug / release version

				    cp libtorch/lib/libtorch_cpu.so libtorch/lib/libtorch_cpu.so.dbg

				    # Keep debug symbols on debug lib

				    strip --only-keep-debug libtorch/lib/libtorch_cpu.so.dbg

				    # Remove debug info from release lib

				    strip --strip-debug libtorch/lib/libtorch_cpu.so

				    # Add a debug link to the release lib to the debug lib (debuggers will then

				    # search for symbols in a file called libtorch_cpu.so.dbg in some

				    # predetermined locations) and embed a CRC32 of the debug library into the .so

				    cd libtorch/lib

				    objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg

				    cd ../..

				    # Move the debug symbols to its own directory so it doesn't get processed /

				    # zipped with all the other libraries

				    mv libtorch/lib/libtorch_cpu.so.dbg debug/libtorch_cpu.so.dbg

				    echo "${PYTORCH_BUILD_VERSION}" > libtorch/build-version

				    echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash

				)

				if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then

				    LIBTORCH_ABI="cxx11-abi-"

				else

				    LIBTORCH_ABI=

				fi

				(

				    set -x

				    mkdir -p /tmp/$LIBTORCH_HOUSE_DIR

				    # objcopy installs a CRC32 into libtorch_cpu above so, so add that to the name here

				    CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch/lib/libtorch_cpu.so)

				    # Zip debug symbols

				    zip /tmp/$LIBTORCH_HOUSE_DIR/debug-libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION-$CRC32.zip debug/libtorch_cpu.so.dbg

				    # Zip and copy libtorch

				    zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch

				    cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \

				       /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip

				)

				popd

				#######################################################################

				# ADD DEPENDENCIES INTO THE WHEEL

				#

				# auditwheel repair doesn't work correctly and is buggy

				# so manually do the work of copying dependency libs and patchelfing

				# and fixing RECORDS entries correctly

				######################################################################

				fname_with_sha256() {

				    HASH=$(sha256sum $1 | cut -c1-8)

				    DIRNAME=$(dirname $1)

				    BASENAME=$(basename $1)

				    if [[ $BASENAME == "libnvrtc-builtins.so" || $BASENAME == "libcudnn"* ]]; then

				        echo $1

				    else

				        INITNAME=$(echo $BASENAME | cut -f1 -d".")

				        ENDNAME=$(echo $BASENAME | cut -f 2- -d".")

				        echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"

				    fi

				}

				fname_without_so_number() {

				    LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')

				    echo "$LINKNAME"

				}

				make_wheel_record() {

				    FPATH=$1

				    if echo $FPATH | grep RECORD >/dev/null 2>&1; then

				        # if the RECORD file, then

				        echo "$FPATH,,"

				    else

				        HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')

				        FSIZE=$(ls -nl $FPATH | awk '{print $5}')

				        echo "$FPATH,sha256=$HASH,$FSIZE"

				    fi

				}

				echo 'Built this package:'

				(

				    set -x

				    mkdir -p /$LIBTORCH_HOUSE_DIR

				    mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR

				    rm -rf /tmp/$LIBTORCH_HOUSE_DIR

				)

				TMP_DIR=$(mktemp -d)

				trap "rm -rf ${TMP_DIR}" EXIT

				pushd "${TMP_DIR}"

				for pkg in /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do

				    # if the glob didn't match anything

				    if [[ ! -e $pkg ]]; then

				        continue

				    fi

				    rm -rf tmp

				    mkdir -p tmp

				    cd tmp

				    cp $pkg .

				    unzip -q $(basename $pkg)

				    rm -f $(basename $pkg)

				    PREFIX=libtorch

				    if [[ $pkg != *"without-deps"* ]]; then

				        # copy over needed dependent .so files over and tag them with their hash

				        patched=()

				        for filepath in "${DEPS_LIST[@]}"; do

				            filename=$(basename $filepath)

				            destpath=$PREFIX/lib/$filename

				            if [[ "$filepath" != "$destpath" ]]; then

				                cp $filepath $destpath

				            fi

				            if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                patchedpath=$(fname_without_so_number $destpath)

				            else

				                patchedpath=$(fname_with_sha256 $destpath)

				            fi

				            patchedname=$(basename $patchedpath)

				            if [[ "$destpath" != "$patchedpath" ]]; then

				                mv $destpath $patchedpath

				            fi

				            patched+=("$patchedname")

				            echo "Copied $filepath to $patchedpath"

				        done

				        echo "patching to fix the so names to the hashed names"

				        for ((i=0;i<${#DEPS_LIST[@]};++i)); do

				            find $PREFIX -name '*.so*' | while read sofile; do

				                origname=${DEPS_SONAME[i]}

				                patchedname=${patched[i]}

				                if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then

				                    set +e

				                    origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")

				                    ERRCODE=$?

				                    set -e

				                    if [ "$ERRCODE" -eq "0" ]; then

				                        echo "patching $sofile entry $origname to $patchedname"

				                        $PATCHELF_BIN --replace-needed $origname $patchedname $sofile

				                    fi

				                fi

				            done

				        done

				        # copy over needed auxiliary files

				        for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do

				            srcpath=${DEPS_AUX_SRCLIST[i]}

				            dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}

				            mkdir -p $(dirname $dstpath)

				            cp $srcpath $dstpath

				        done

				    fi

				    # set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib

				    find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to " '$ORIGIN:$ORIGIN/lib'

				        $PATCHELF_BIN --set-rpath '$ORIGIN:$ORIGIN/lib' $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # set RPATH of lib/ files to $ORIGIN

				    find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do

				        echo "Setting rpath of $sofile to " '$ORIGIN'

				        $PATCHELF_BIN --set-rpath '$ORIGIN' $sofile

				        $PATCHELF_BIN --print-rpath $sofile

				    done

				    # regenerate the RECORD file with new hashes

				    record_file=`echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g'`

				    if [[ -e $record_file ]]; then

				        echo "Generating new record file $record_file"

				        rm -f $record_file

				        # generate records for folders in wheel

				        find * -type f | while read fname; do

				            echo $(make_wheel_record $fname) >>$record_file

				        done

				    fi

				    # zip up the wheel back

				    zip -rq $(basename $pkg) $PREFIX*

				    # replace original wheel

				    rm -f $pkg

				    mv $(basename $pkg) $pkg

				    cd ..

				    rm -rf tmp

				done

				# Copy wheels to host machine for persistence before testing

				if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				    cp /$LIBTORCH_HOUSE_DIR/debug-libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"

				fi

									
										263

.ci/manywheel/build_rocm.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,263 @@

				#!/usr/bin/env bash

				set -ex

				export ROCM_HOME=/opt/rocm

				export MAGMA_HOME=$ROCM_HOME/magma

				# TODO: libtorch_cpu.so is broken when building with Debug info

				export BUILD_DEBUG_INFO=0

				# TODO Are these all used/needed?

				export TH_BINARY_BUILD=1

				export USE_STATIC_CUDNN=1

				export USE_STATIC_NCCL=1

				export ATEN_STATIC_CUDA=1

				export USE_CUDA_STATIC_LINK=1

				export INSTALL_TEST=0 # dont install test binaries into site-packages

				# Set RPATH instead of RUNPATH when using patchelf to avoid LD_LIBRARY_PATH override

				export FORCE_RPATH="--force-rpath"

				# Keep an array of cmake variables to add to

				if [[ -z "$CMAKE_ARGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build()

				    CMAKE_ARGS=()

				fi

				if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then

				    # These are passed to tools/build_pytorch_libs.sh::build_caffe2()

				    EXTRA_CAFFE2_CMAKE_FLAGS=()

				fi

				# Determine ROCm version and architectures to build for

				#

				# NOTE: We should first check `DESIRED_CUDA` when determining `ROCM_VERSION`

				if [[ -n "$DESIRED_CUDA" ]]; then

				    if ! echo "${DESIRED_CUDA}"| grep "^rocm" >/dev/null 2>/dev/null; then

				        export DESIRED_CUDA="rocm${DESIRED_CUDA}"

				    fi

				    # rocm3.7, rocm3.5.1

				    ROCM_VERSION="$DESIRED_CUDA"

				    echo "Using $ROCM_VERSION as determined by DESIRED_CUDA"

				else

				    echo "Must set DESIRED_CUDA"

				    exit 1

				fi

				# Package directories

				WHEELHOUSE_DIR="wheelhouse$ROCM_VERSION"

				LIBTORCH_HOUSE_DIR="libtorch_house$ROCM_VERSION"

				if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    if [[ -z "$BUILD_PYTHONLESS" ]]; then

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$ROCM_VERSION"

				    else

				        PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$ROCM_VERSION"

				    fi

				fi

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				# To make version comparison easier, create an integer representation.

				ROCM_VERSION_CLEAN=$(echo ${ROCM_VERSION} | sed s/rocm//)

				save_IFS="$IFS"

				IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION_CLEAN})

				IFS="$save_IFS"

				if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=0

				elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then

				    ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}

				    ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}

				    ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}

				else

				    echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"

				    exit 1

				fi

				ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))

				# Required ROCm libraries

				ROCM_SO_FILES=(

				    "libMIOpen.so"

				    "libamdhip64.so"

				    "libhipblas.so"

				    "libhipfft.so"

				    "libhiprand.so"

				    "libhipsolver.so"

				    "libhipsparse.so"

				    "libhsa-runtime64.so"

				    "libamd_comgr.so"

				    "libmagma.so"

				    "librccl.so"

				    "librocblas.so"

				    "librocfft.so"

				    "librocm_smi64.so"

				    "librocrand.so"

				    "librocsolver.so"

				    "librocsparse.so"

				    "libroctracer64.so"

				    "libroctx64.so"

				    "libhipblaslt.so"

				    "libhiprtc.so"

				)

				if [[ $ROCM_INT -ge 60100 ]]; then

				    ROCM_SO_FILES+=("librocprofiler-register.so")

				fi

				if [[ $ROCM_INT -ge 60200 ]]; then

				    ROCM_SO_FILES+=("librocm-core.so")

				fi

				OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`

				if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then

				    LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

				    LIBNUMA_PATH="/usr/lib64/libnuma.so.1"

				    LIBELF_PATH="/usr/lib64/libelf.so.1"

				    LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"

				    LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"

				        LIBCHOLMOD_PATH="/lib64/libcholmod.so.2"

				        # Below libs are direct dependencies of libcholmod

				        LIBAMD_PATH="/lib64/libamd.so.2"

				        LIBCAMD_PATH="/lib64/libcamd.so.2"

				        LIBCCOLAMD_PATH="/lib64/libccolamd.so.2"

				        LIBCOLAMD_PATH="/lib64/libcolamd.so.2"

				        LIBSATLAS_PATH="/lib64/atlas/libsatlas.so.3"

				        # Below libs are direct dependencies of libsatlas

				        LIBGFORTRAN_PATH="/lib64/libgfortran.so.3"

				        LIBQUADMATH_PATH="/lib64/libquadmath.so.0"

				    fi

				    MAYBE_LIB64=lib64

				elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then

				    LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"

				    LIBNUMA_PATH="/usr/lib/x86_64-linux-gnu/libnuma.so.1"

				    LIBELF_PATH="/usr/lib/x86_64-linux-gnu/libelf.so.1"

				    if [[ $ROCM_INT -ge 50300 ]]; then

				        LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.6"

				    else

				        LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.5"

				    fi

				    LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"

				    LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"

				    if [[ $ROCM_INT -ge 60100 ]]; then

				        # Below libs are direct dependencies of libhipsolver

				        LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"

				        # Below libs are direct dependencies of libcholmod

				        LIBSUITESPARSE_CONFIG_PATH="/lib/x86_64-linux-gnu/libsuitesparseconfig.so.5"

				        LIBAMD_PATH="/lib/x86_64-linux-gnu/libamd.so.2"

				        LIBCAMD_PATH="/lib/x86_64-linux-gnu/libcamd.so.2"

				        LIBCCOLAMD_PATH="/lib/x86_64-linux-gnu/libccolamd.so.2"

				        LIBCOLAMD_PATH="/lib/x86_64-linux-gnu/libcolamd.so.2"

				        LIBMETIS_PATH="/lib/x86_64-linux-gnu/libmetis.so.5"

				        LIBLAPACK_PATH="/lib/x86_64-linux-gnu/liblapack.so.3"

				        LIBBLAS_PATH="/lib/x86_64-linux-gnu/libblas.so.3"

				        # Below libs are direct dependencies of libblas

				        LIBGFORTRAN_PATH="/lib/x86_64-linux-gnu/libgfortran.so.5"

				        LIBQUADMATH_PATH="/lib/x86_64-linux-gnu/libquadmath.so.0"

				    fi

				    MAYBE_LIB64=lib

				fi

				OS_SO_PATHS=($LIBGOMP_PATH $LIBNUMA_PATH\

				             $LIBELF_PATH $LIBTINFO_PATH\

				             $LIBDRM_PATH $LIBDRM_AMDGPU_PATH\

				             $LIBSUITESPARSE_CONFIG_PATH\

				             $LIBCHOLMOD_PATH $LIBAMD_PATH\

				             $LIBCAMD_PATH $LIBCCOLAMD_PATH\

				             $LIBCOLAMD_PATH $LIBSATLAS_PATH\

				             $LIBGFORTRAN_PATH $LIBQUADMATH_PATH\

				             $LIBMETIS_PATH $LIBLAPACK_PATH\

				             $LIBBLAS_PATH)

				OS_SO_FILES=()

				for lib in "${OS_SO_PATHS[@]}"

				do

				    file_name="${lib##*/}" # Substring removal of path to get filename

				    OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array

				done

				# PyTorch-version specific

				# AOTriton dependency only for PyTorch >= 2.4

				if (( $(echo "${PYTORCH_VERSION} 2.4" | awk '{print ($1 >= $2)}') )); then

				    ROCM_SO_FILES+=("libaotriton_v2.so")

				fi

				# rocBLAS library files

				ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

				ROCBLAS_LIB_DST=lib/rocblas/library

				ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep

				ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)

				ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				# hipblaslt library files

				HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

				HIPBLASLT_LIB_DST=lib/hipblaslt/library

				ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)

				OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)

				HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)

				# ROCm library files

				ROCM_SO_PATHS=()

				for lib in "${ROCM_SO_FILES[@]}"

				do

				    file_path=($(find $ROCM_HOME/lib/ -name "$lib")) # First search in lib

				    if [[ -z $file_path ]]; then

				        if [ -d "$ROCM_HOME/lib64/" ]; then

				            file_path=($(find $ROCM_HOME/lib64/ -name "$lib")) # Then search in lib64

				        fi

				    fi

				    if [[ -z $file_path ]]; then

				        file_path=($(find $ROCM_HOME/ -name "$lib")) # Then search in ROCM_HOME

				    fi

				    if [[ -z $file_path ]]; then

				        echo "Error: Library file $lib is not found." >&2

				        exit 1

				    fi

				    ROCM_SO_PATHS[${#ROCM_SO_PATHS[@]}]="$file_path" # Append lib to array

				done

				DEPS_LIST=(

				    ${ROCM_SO_PATHS[*]}

				    ${OS_SO_PATHS[*]}

				)

				DEPS_SONAME=(

				    ${ROCM_SO_FILES[*]}

				    ${OS_SO_FILES[*]}

				)

				DEPS_AUX_SRCLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"

				    "/opt/amdgpu/share/libdrm/amdgpu.ids"

				)

				DEPS_AUX_DSTLIST=(

				    "${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"

				    "${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"

				    "share/libdrm/amdgpu.ids"

				)

				# MIOpen library files

				MIOPEN_SHARE_SRC=$ROCM_HOME/share/miopen/db

				MIOPEN_SHARE_DST=share/miopen/db

				MIOPEN_SHARE_FILES=($(ls $MIOPEN_SHARE_SRC | grep -E $ARCH))

				DEPS_AUX_SRCLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_DST/})

				# RCCL library files

				RCCL_SHARE_SRC=$ROCM_HOME/share/rccl/msccl-algorithms

				RCCL_SHARE_DST=share/rccl/msccl-algorithms

				RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))

				DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})

				DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})

				echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"

				SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

				if [[ -z "$BUILD_PYTHONLESS" ]]; then

				    BUILD_SCRIPT=build_common.sh

				else

				    BUILD_SCRIPT=build_libtorch.sh

				fi

				source $SCRIPTPATH/${BUILD_SCRIPT}

									
										26

.ci/manywheel/test_wheel.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,26 @@

				#!/usr/bin/env bash

				set -e

				yum install -y wget git

				rm -rf /usr/local/cuda*

				# Install Anaconda

				if ! ls /py

				then

				    echo "Miniconda needs to be installed"

				    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh

				    bash ~/miniconda.sh -b -p /py

				else

				    echo "Miniconda is already installed"

				fi

				export PATH="/py/bin:$PATH"

				# Anaconda token

				if ls /remote/token

				then

				   source /remote/token

				fi

				conda install -y conda-build anaconda-client

									
										41

.ci/pytorch/README.md
									
												View File
												
				@ -1,42 +1 @@

				This directory contains scripts for our continuous integration.

				One important thing to keep in mind when reading the scripts here is

				that they are all based off of Docker images, which we build for each of

				the various system configurations we want to run on Jenkins.  This means

				it is very easy to run these tests yourself:

				1. Figure out what Docker image you want.  The general template for our

				   images look like:

				   ``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,

				   where ``$BUILD_ENVIRONMENT`` is one of the build environments

				   enumerated in

				   [pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)

				2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and

				   run one of the scripts in this directory.

				The Docker images are designed so that any "reasonable" build commands

				will work; if you look in [build.sh](build.sh) you will see that it is a

				very simple script.  This is intentional.  Idiomatic build instructions

				should work inside all of our Docker images.  You can tweak the commands

				however you need (e.g., in case you want to rebuild with DEBUG, or rerun

				the build with higher verbosity, etc.).

				We have to do some work to make this so.  Here is a summary of the

				mechanisms we use:

				- We install binaries to directories like `/usr/local/bin` which

				  are automatically part of your PATH.

				- We add entries to the PATH using Docker ENV variables (so

				  they apply when you enter Docker) and `/etc/environment` (so they

				  continue to apply even if you sudo), instead of modifying

				  `PATH` in our build scripts.

				- We use `/etc/ld.so.conf.d` to register directories containing

				  shared libraries, instead of modifying `LD_LIBRARY_PATH` in our

				  build scripts.

				- We reroute well known paths like `/usr/bin/gcc` to alternate

				  implementations with `update-alternatives`, instead of setting

				  `CC` and `CXX` in our implementations.

									
										68

.ci/pytorch/build.sh
									
												View File
												
				@ -49,13 +49,8 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				fi

				# Enable LLVM dependency for TensorExpr testing

				if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export USE_LLVM=/opt/rocm/llvm

				  export LLVM_DIR=/opt/rocm/llvm/lib/cmake/llvm

				else

				  export USE_LLVM=/opt/llvm

				  export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				fi

				export USE_LLVM=/opt/llvm

				export LLVM_DIR=/opt/llvm/lib/cmake/llvm

				if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then

				  # To build test_edge_op_registration

				@ -176,13 +171,14 @@ fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  export USE_XPU=1

				  # XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA

				  export USE_KINETO=0

				fi

				# sccache will fail for CUDA builds if all cores are used for compiling

				# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used

				if [ -z "$MAX_JOBS" ]; then

				  if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]] || [[ "$BUILD_ENVIRONMENT" == *gcc7* ]]; } && which sccache > /dev/null; then

				  if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; } && which sccache > /dev/null; then

				    export MAX_JOBS=$(($(nproc) - 1))

				  fi

				fi

				@ -207,10 +203,12 @@ if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then

				  export LDSHARED="clang --shared"

				  export USE_CUDA=0

				  if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				    export USE_CUDA=1

				  fi

				  export USE_ASAN=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"

				  export REL_WITH_DEB_INFO=1

				  export UBSAN_FLAGS="-fno-sanitize-recover=all"

				  unset USE_LLVM

				fi

				@ -222,17 +220,17 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then

				    export USE_PRECOMPILED_HEADERS=1

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build*  ]]; then

				  export USE_GLOO_WITH_OPENSSL=ON

				fi

				if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				  export CMAKE_BUILD_TYPE=RelWithAssert

				fi

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then

				  # Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

				  WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")

				  cleanup_workspace() {

				@ -280,16 +278,29 @@ else

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0 release candidate for builds

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0rc1

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install --pre numpy==2.0.2

				      fi

				      WERROR=1 python setup.py clean

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel

				        BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 python setup.py bdist_wheel --cmake

				      else

				        WERROR=1 python setup.py bdist_wheel

				      fi

				      WERROR=1 python setup.py bdist_wheel

				    else

				      python setup.py clean

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      python setup.py bdist_wheel

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        echo "USE_SPLIT_BUILD cannot be used with xla or rocm"

				        exit 1

				      else

				        python setup.py bdist_wheel

				      fi

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				@ -327,10 +338,11 @@ else

				    CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"

				    CUSTOM_OP_TEST="$PWD/test/custom_operator"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$CUSTOM_OP_BUILD"

				    pushd "$CUSTOM_OP_BUILD"

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -340,10 +352,10 @@ else

				    JIT_HOOK_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/jit-hook-build"

				    JIT_HOOK_TEST="$PWD/test/jit_hooks"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"

				    mkdir -p "$JIT_HOOK_BUILD"

				    pushd "$JIT_HOOK_BUILD"

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -355,7 +367,7 @@ else

				    python --version

				    mkdir -p "$CUSTOM_BACKEND_BUILD"

				    pushd "$CUSTOM_BACKEND_BUILD"

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPython_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -388,6 +400,6 @@ fi

				# snadampal: skipping it till sccache support added for aarch64

				# https://github.com/pytorch/pytorch/issues/121559

				if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then

				if [[ "$BUILD_ENVIRONMENT" != *aarch64* &&  "$BUILD_ENVIRONMENT" != *s390x* ]]; then

				  print_sccache_stats

				fi

									
										61

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -56,9 +56,29 @@ function assert_git_not_dirty() {

				function pip_install_whl() {

				  # This is used to install PyTorch and other build artifacts wheel locally

				  # without using any network connection

				  python3 -mpip install --no-index --no-deps "$@"

				  # Convert the input arguments into an array

				  local args=("$@")

				  # Check if the first argument contains multiple paths separated by spaces

				  if [[ "${args[0]}" == *" "* ]]; then

				    # Split the string by spaces into an array

				    IFS=' ' read -r -a paths <<< "${args[0]}"

				    # Loop through each path and install individually

				    for path in "${paths[@]}"; do

				      echo "Installing $path"

				      python3 -mpip install --no-index --no-deps "$path"

				    done

				  else

				    # Loop through each argument and install individually

				    for path in "${args[@]}"; do

				      echo "Installing $path"

				      python3 -mpip install --no-index --no-deps "$path"

				    done

				  fi

				}

				function pip_install() {

				  # retry 3 times

				  # old versions of pip don't have the "--progress-bar" flag

				@ -159,7 +179,7 @@ function install_torchvision() {

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.7"

				  pip_install --user "tlparse==0.3.25"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

				@ -171,9 +191,22 @@ function install_torchrec_and_fbgemm() {

				  pip_uninstall torchrec-nightly

				  pip_uninstall fbgemm-gpu-nightly

				  pip_install setuptools-git-versioning scikit-build pyre-extensions

				  # TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it

				  # seems to be an sccache-related issue

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    unset CMAKE_CUDA_COMPILER_LAUNCHER

				    sudo mv /opt/cache/bin /opt/cache/bin-backup

				  fi

				  # See https://github.com/pytorch/pytorch/issues/106971

				  CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				  if [[ "$IS_A100_RUNNER" == "1" ]]; then

				    export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache

				    sudo mv /opt/cache/bin-backup /opt/cache/bin

				  fi

				}

				function clone_pytorch_xla() {

				@ -188,28 +221,6 @@ function clone_pytorch_xla() {

				  fi

				}

				function checkout_install_torchdeploy() {

				  local commit

				  commit=$(get_pinned_commit multipy)

				  pushd ..

				  git clone --recurse-submodules https://github.com/pytorch/multipy.git

				  pushd multipy

				  git checkout "${commit}"

				  python multipy/runtime/example/generate_examples.py

				  BUILD_CUDA_TESTS=1 pip install -e .

				  popd

				  popd

				}

				function test_torch_deploy(){

				 pushd ..

				 pushd multipy

				 ./multipy/runtime/build/test_deploy

				 ./multipy/runtime/build/test_deploy_gpu

				 popd

				 popd

				}

				function checkout_install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				@ -224,6 +235,8 @@ function checkout_install_torchbench() {

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				}

									
										13

.ci/pytorch/create_test_cert.py
									
												View File
												
				@ -1,4 +1,4 @@

				from datetime import datetime, timedelta

				from datetime import datetime, timedelta, timezone

				from tempfile import mkdtemp

				from cryptography import x509

				@ -6,6 +6,7 @@ from cryptography.hazmat.primitives import hashes, serialization

				from cryptography.hazmat.primitives.asymmetric import rsa

				from cryptography.x509.oid import NameOID

				temp_dir = mkdtemp()

				print(temp_dir)

				@ -41,11 +42,10 @@ def create_cert(path, C, ST, L, O, key):

				        .issuer_name(issuer)

				        .public_key(key.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				            datetime.now(timezone.utc) + timedelta(days=10)

				        )

				        .add_extension(

				            x509.BasicConstraints(ca=True, path_length=None),

				@ -87,11 +87,10 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):

				        .issuer_name(ca_cert.subject)

				        .public_key(csr_cert.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_before(datetime.now(timezone.utc))

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				            datetime.now(timezone.utc) + timedelta(days=10)

				            # Sign our certificate with our private key

				        )

				        .sign(private_ca_key, hashes.SHA256())

									
										19

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -9,15 +9,13 @@ if [[ -n "$CONDA_ENV" ]]; then

				  export PATH="$CONDA_ENV/bin":$PATH

				fi

				# Test that OpenMP is enabled for non-arm64 build

				if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then

				  pushd test

				  if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				    echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				    exit 1

				  fi

				  popd

				# Test that OpenMP is enabled

				pushd test

				if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				  echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				  exit 1

				fi

				popd

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				@ -27,8 +25,9 @@ setup_test_python() {

				  echo "Ninja version: $(ninja --version)"

				  echo "Python version: $(which python) ($(python --version))"

				  # Increase default limit on open file handles from 256 to 1024

				  ulimit -n 1024

				  # Set the limit on open file handles to 16384

				  # might help with intermittent compiler test failures

				  ulimit -n 16384

				}

				test_python_all() {

									
										10

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -18,8 +18,9 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				time python test/run_test.py --verbose -i distributed/test_cuda_p2p

				time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				time python test/run_test.py --verbose -i distributed/test_store

				time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# FSDP tests

				@ -43,16 +44,15 @@ time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compi

				time python test/run_test.py --verbose -i distributed/test_device_mesh

				# DTensor/TP tests

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state

				# FSDP2 tests

				time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				# Pipelining composability tests

				time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py

				# ND composability tests

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability

				time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

									
										1

.ci/pytorch/perf_test/compare_with_baseline.py
									
												View File
												
				@ -3,6 +3,7 @@ import json

				import math

				import sys

				parser = argparse.ArgumentParser()

				parser.add_argument(

				    "--test-name", dest="test_name", action="store", required=True, help="test name"

									
										1

.ci/pytorch/perf_test/get_stats.py
									
												View File
												
				@ -3,6 +3,7 @@ import sys

				import numpy

				sample_data_list = sys.argv[1:]

				sample_data_list = [float(v.strip()) for v in sample_data_list]

									
										1

.ci/pytorch/perf_test/update_commit_hash.py
									
												View File
												
				@ -1,6 +1,7 @@

				import json

				import sys

				data_file_path = sys.argv[1]

				commit_hash = sys.argv[2]

									
										1

.ci/pytorch/print_sccache_log.py
									
												View File
												
				@ -1,5 +1,6 @@

				import sys

				log_file_path = sys.argv[1]

				with open(log_file_path) as f:

									
										496

.ci/pytorch/test.sh
									
												View File
												
				@ -6,6 +6,9 @@

				set -ex

				# Suppress ANSI color escape sequences

				export TERM=vt100

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				@ -46,16 +49,16 @@ NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"

				export VALGRIND=ON

				# export TORCH_INDUCTOR_INSTALL_GXX=ON

				if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  # clang9 appears to miscompile code involving c10::optional<c10::SymInt>,

				  # clang9 appears to miscompile code involving std::optional<c10::SymInt>,

				  # such that valgrind complains along these lines:

				  #

				  # Conditional jump or move depends on uninitialised value(s)

				  #    at 0x40303A: ~optional_base (Optional.h:281)

				  #    by 0x40303A: call (Dispatcher.h:448)

				  #    by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:10)

				  #    by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:10)

				  #    by 0x403700: main (basic.cpp:16)

				  #  Uninitialised value was created by a stack allocation

				  #    at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:6)

				  #    at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:6)

				  #

				  # The problem does not appear with gcc or newer versions of clang (we tested

				  # clang14).  So we suppress valgrind testing for clang9 specifically.

				@ -69,7 +72,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

				  #

				  # using namespace at;

				  #

				  # Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset) {

				  # Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, std::optional<c10::SymInt> storage_offset) {

				  #   auto op = c10::Dispatcher::singleton()

				  #       .findSchemaOrThrow(at::_ops::as_strided::name, at::_ops::as_strided::overload_name)

				  #       .typed<at::_ops::as_strided::schema>();

				@ -166,7 +169,7 @@ fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # Source Intel oneAPI envrioment script to enable xpu runtime related libraries

				  # refer to https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-0/use-the-setvars-and-oneapi-vars-scripts-with-linux.html

				  # refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  # Check XPU status before testing

				@ -193,6 +196,9 @@ install_tlparse

				# ASAN test is not working

				if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=true:strict_init_order=true:detect_odr_violation=1:detect_container_overflow=0:check_initialization_order=true:debug=true

				    if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				        export ASAN_OPTIONS="${ASAN_OPTIONS}:protect_shadow_gap=0"

				    fi

				    export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp

				    export PYTORCH_TEST_WITH_ASAN=1

				    export PYTORCH_TEST_WITH_UBSAN=1

				@ -230,8 +236,8 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

				    # it depends on a ton of dynamic libraries that most programs aren't gonna

				    # have, and it applies to child processes.

				    # TODO: get rid of the hardcoded path

				    export LD_PRELOAD=/usr/lib/llvm-15/lib/clang/15.0.7/lib/linux/libclang_rt.asan-x86_64.so

				    LD_PRELOAD=$(clang --print-file-name=libclang_rt.asan-x86_64.so)

				    export LD_PRELOAD

				    # Disable valgrind for asan

				    export VALGRIND=OFF

				@ -249,9 +255,7 @@ fi

				# This tests that the debug asserts are working correctly.

				if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				    echo "We are in debug mode: $BUILD_ENVIRONMENT. Expect the python assertion to fail"

				    # TODO: Enable the check after we setup the build to run debug asserts without having

				    #       to do a full (and slow) debug build

				    # (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")

				    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")

				elif [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				    # Noop when debug is disabled. Skip bazel jobs because torch isn't available there yet.

				    echo "We are not in debug mode: $BUILD_ENVIRONMENT. Expect the assertion to pass"

				@ -264,18 +268,6 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then

				  export ATEN_CPU_CAPABILITY=avx2

				fi

				# temp workarounds for https://github.com/pytorch/pytorch/issues/126692, remove when fixed

				if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				  pushd test

				  CUDA_VERSION=$(python -c "import torch; print(torch.version.cuda)")

				  if [ "$CUDA_VERSION" == "12.4" ]; then

				    ISCUDA124="cu124"

				  else

				    ISCUDA124=""

				  fi

				  popd

				fi

				test_python_legacy_jit() {

				  time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose

				  assert_git_not_dirty

				@ -289,7 +281,10 @@ test_python_shard() {

				  # Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly

				  # shellcheck disable=SC2086

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION

				  # modify LD_LIBRARY_PATH to ensure it has the conda env.

				  # This set of tests has been shown to be buggy without it for the split-build

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				@ -315,7 +310,8 @@ test_dynamo_shard() {

				    --exclude-distributed-tests \

				    --exclude-torch-export-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				    --verbose \

				    --upload-artifacts-while-running

				  assert_git_not_dirty

				}

				@ -327,56 +323,79 @@ test_inductor_distributed() {

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				  python test/run_test.py -i distributed/_composable/test_replicate_with_compiler.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_save_load --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_compile.py --verbose

				  python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				  # with if required # gpus aren't available

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives --verbose

				  python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_compute_comm_reordering --verbose

				  assert_git_not_dirty

				}

				test_inductor() {

				  python tools/dynamo/verify_dynamo.py

				  python test/run_test.py --inductor --include test_modules test_ops test_ops_gradients test_torch --verbose

				  # Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor --verbose

				  # docker build uses bdist_wheel which does not work with test_aot_inductor

				  # TODO: need a faster way to build

				  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				      BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				      CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				test_inductor_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				  fi

				  python tools/dynamo/verify_dynamo.py

				  python test/run_test.py --inductor \

				    --include test_modules test_ops test_ops_gradients test_torch \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  # Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state

				  python test/run_test.py \

				    --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				  export TORCHINDUCTOR_ABI_COMPATIBLE=1

				test_inductor_aoti() {

				  # docker build uses bdist_wheel which does not work with test_aot_inductor

				  # TODO: need a faster way to build

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				    # We need to hipify before building again

				    python3 tools/amd_build/build_amd.py

				  fi

				  BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				}

				test_inductor_cpp_wrapper() {

				  export TORCHINDUCTOR_CPP_WRAPPER=1

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				  python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				    --training --inductor --disable-cudagraphs --only vit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_timm_training.csv"

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -387,7 +406,22 @@ test_inductor_cpp_wrapper_abi_compatible() {

				# .github/workflows/inductor-perf-test-nightly.yml

				DYNAMO_BENCHMARK_FLAGS=()

				if [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then

				pr_time_benchmarks() {

				  pip_install --user "fbscribelogger"

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  echo "benchmark results on current PR: "

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks python benchmarks/dynamo/pr_time_benchmarks/check_results.py "benchmarks/dynamo/pr_time_benchmarks/expected_results.csv" "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "$TEST_REPORTS_DIR/new_expected_results.csv"

				}

				if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then

				  pr_time_benchmarks

				  exit 0

				elif [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--backend eager)

				elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)

				@ -401,7 +435,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)

				fi

				if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--device cpu)

				else

				  DYNAMO_BENCHMARK_FLAGS+=(--device cuda)

				@ -425,6 +459,18 @@ test_perf_for_dashboard() {

				  # TODO: All the accuracy tests can be skipped once the CI accuracy checking is stable enough

				  local targets=(accuracy performance)

				  local device=cuda

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				      device=cpu_x86

				    elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then

				      device=cpu_aarch64

				    fi

				    test_inductor_set_cpu_affinity

				  elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then

				    device=cuda_a10g

				  fi

				  for mode in "${modes[@]}"; do

				    if [[ "$mode" == "inference" ]]; then

				      dtype=bfloat16

				@ -440,56 +486,62 @@ test_perf_for_dashboard() {

				      fi

				      if [[ "$DASHBOARD_TAG" == *default-true* ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cudagraphs-true* ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *dynamic-true* ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --dynamic-shapes \

				            --dynamic-batch-only "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_CPP_WRAPPER=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \

				        if [[ "$target" == "accuracy" ]]; then

				          # Also collect Export pass rate and display as a separate row

				          $TASKSET python "benchmarks/dynamo/$suite.py" \

				              "${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \

				              --output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				        fi

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *maxautotune-true* ]]; then

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				            --output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cudagraphs_low_precision-true* ]] && [[ "$mode" == "inference" ]]; then

				        # TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.

				        # The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data

				        # to fill the dashboard.

				        python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				          "${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \

				          --output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv" || true

				          --output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv" || true

				        # Copy cudagraph results as mock data, easiest choice?

				        cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv" \

				          "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv"

				        cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv" \

				          "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv"

				      fi

				    done

				  done

				@ -526,10 +578,11 @@ test_single_dynamo_benchmark() {

				    test_perf_for_dashboard "$suite" \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"

				  else

				    if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      # Test AOTInductor with the ABI-compatible mode on CI

				      # This can be removed once the ABI-compatible mode becomes default.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    if [[ "${TEST_CONFIG}" == *_avx2* ]]; then

				      TEST_CONFIG=${TEST_CONFIG//_avx2/}

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx512* ]]; then

				      TEST_CONFIG=${TEST_CONFIG//_avx512/}

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				@ -538,18 +591,31 @@ test_single_dynamo_benchmark() {

				      --output "$TEST_REPORTS_DIR/${name}_${suite}.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				    python benchmarks/dynamo/check_graph_breaks.py \

				      --actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/${TEST_CONFIG}_${name}.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"

				  fi

				}

				test_inductor_micro_benchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    test_inductor_set_cpu_affinity

				  fi

				  python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"

				}

				test_inductor_halide() {

				  python test/run_test.py --include inductor/test_halide.py --verbose

				  assert_git_not_dirty

				}

				test_inductor_triton_cpu() {

				  python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose

				  assert_git_not_dirty

				}

				test_dynamo_benchmark() {

				  # Usage: test_dynamo_benchmark huggingface 0

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				@ -564,11 +630,15 @@ test_dynamo_benchmark() {

				  elif [[ "${TEST_CONFIG}" == *perf* ]]; then

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				  else

				    if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				    if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				      local dt="float32"

				      if [[ "${TEST_CONFIG}" == *amp* ]]; then

				        dt="amp"

				      fi

				      if [[ "${TEST_CONFIG}" == *freezing* ]]; then

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 --freezing "$@"

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" --freezing "$@"

				      else

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" "$@"

				      fi

				    elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				@ -583,38 +653,12 @@ test_inductor_torchbench_smoketest_perf() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # Test some models in the cpp wrapper mode

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_torchbench_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				    --output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \

				    --export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  # The perf number of nanogpt seems not very stable, e.g.

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  # Use 4.7 for cuda 12.4, change back to 4.9 after fixing https://github.com/pytorch/pytorch/issues/126692

				  if [ "$CUDA_VERSION" == "12.4" ]; then

				    THRESHOLD=4.7

				  else

				    THRESHOLD=4.9

				  fi

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t $THRESHOLD

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				    python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \

				@ -632,51 +676,74 @@ test_inductor_torchbench_smoketest_perf() {

				      --only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"

				    python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/${ISCUDA124}/inductor_huggingface_training.csv"

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_huggingface_training.csv"

				  done

				}

				test_inductor_get_core_number() {

				  if [[ "${TEST_CONFIG}" == *aarch64* ]]; then

				    echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"

				  else

				    echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"

				  fi

				}

				test_inductor_set_cpu_affinity(){

				  #set jemalloc

				  JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  if [[ "${TEST_CONFIG}" != *aarch64* ]]; then

				    # Use Intel OpenMP for x86

				    IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				    export LD_PRELOAD="$IOMP_LIB":"$LD_PRELOAD"

				    export KMP_AFFINITY=granularity=fine,compact,1,0

				    export KMP_BLOCKTIME=1

				  fi

				  cores=$(test_inductor_get_core_number)

				  # Set number of cores to 16 on Aarch64 for performance runs.

				  if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then

				    cores=16

				  fi

				  export OMP_NUM_THREADS=$cores

				  end_core=$((cores-1))

				  export TASKSET="taskset -c 0-$end_core"

				}

				test_inductor_torchbench_cpu_smoketest_perf(){

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  #set jemalloc

				  JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"

				  IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  export KMP_AFFINITY=granularity=fine,compact,1,0

				  export KMP_BLOCKTIME=1

				  CORES=$(lscpu | grep Core | awk '{print $4}')

				  export OMP_NUM_THREADS=$CORES

				  end_core=$(( CORES-1 ))

				  test_inductor_set_cpu_affinity

				  MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv

				  grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg

				  do

				    local model_name=${model_cfg[0]}

				    local data_type=${model_cfg[1]}

				    local speedup_target=${model_cfg[4]}

				    if [[ ${model_cfg[3]} == "cpp" ]]; then

				    local data_type=${model_cfg[2]}

				    local speedup_target=${model_cfg[5]}

				    local backend=${model_cfg[1]}

				    if [[ ${model_cfg[4]} == "cpp" ]]; then

				      export TORCHINDUCTOR_CPP_WRAPPER=1

				    else

				      unset TORCHINDUCTOR_CPP_WRAPPER

				    fi

				    local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"

				    if [[ ${model_cfg[2]} == "dynamic" ]]; then

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				    if [[ ${model_cfg[3]} == "dynamic" ]]; then

				      $TASKSET python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \

				        --dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"

				        --dynamic-batch-only --freezing --timeout 9000 --"$backend" --output "$output_name"

				    else

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				      $TASKSET python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \

				        --freezing --timeout 9000 --backend=inductor --output "$output_name"

				        --freezing --timeout 9000 --"$backend" --output "$output_name"

				    fi

				    cat "$output_name"

				    # The threshold value needs to be actively maintained to make this check useful.

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"

				    # Allow 1% variance for CPU perf to accommodate perf fluctuation

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target" -s 0.99

				  done

				}

				@ -991,11 +1058,113 @@ test_xla() {

				  assert_git_not_dirty

				}

				function check_public_api_test_fails {

				    test_name=$1

				    invalid_item_name=$2

				    invalid_item_desc=$3

				    echo "Running public API test '${test_name}'..."

				    test_output=$(python test/test_public_bindings.py -k "${test_name}" 2>&1) && ret=$? || ret=$?

				    # Ensure test fails correctly.

				    if [ "$ret" -eq 0 ]; then

				        cat << EOF

				Expected the public API test '${test_name}' to fail after introducing

				${invalid_item_desc}, but it succeeded! Check test/test_public_bindings.py

				for any changes that may have broken the test.

				EOF

				        return 1

				    fi

				    # Ensure invalid item is in the test output.

				    echo "${test_output}" | grep -q "${invalid_item_name}" && ret=$? || ret=$?

				    if [ $ret -ne 0 ]; then

				        cat << EOF

				Expected the public API test '${test_name}' to identify ${invalid_item_desc}, but

				it didn't! It's possible the test may not have run. Check test/test_public_bindings.py

				for any changes that may have broken the test.

				EOF

				        return 1

				    fi

				    echo "Success! '${test_name}' identified ${invalid_item_desc} ${invalid_item_name}."

				    return 0

				}

				# Do NOT run this test before any other tests, like test_python_shard, etc.

				# Because this function uninstalls the torch built from branch and installs

				# the torch built on its base commit.

				test_forward_backward_compatibility() {

				  set -x

				  # First, validate public API tests in the torch built from branch.

				  # Step 1. Make sure the public API test "test_correct_module_names" fails when a new file

				  # introduces an invalid public API function.

				  new_filename=$(mktemp XXXXXXXX.py -p "${TORCH_INSTALL_DIR}")

				  BAD_PUBLIC_FUNC=$(

				  cat << 'EOF'

				def new_public_func():

				  pass

				# valid public API functions have __module__ set correctly

				new_public_func.__module__ = None

				EOF

				  )

				  echo "${BAD_PUBLIC_FUNC}" >> "${new_filename}"

				  invalid_api="torch.$(basename -s '.py' "${new_filename}").new_public_func"

				  echo "Created an invalid public API function ${invalid_api}..."

				  check_public_api_test_fails \

				      "test_correct_module_names" \

				      "${invalid_api}" \

				      "an invalid public API function" && ret=$? || ret=$?

				  rm -v "${new_filename}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing

				  # file is modified to introduce an invalid public API function.

				  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"

				  cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"

				  echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"

				  invalid_api="torch.nn.parameter.new_public_func"

				  echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."

				  check_public_api_test_fails \

				      "test_correct_module_names" \

				      "${invalid_api}" \

				      "an invalid public API function" && ret=$? || ret=$?

				  mv -v "${EXISTING_FILEPATH}.orig" "${EXISTING_FILEPATH}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Step 3. Make sure that the public API test "test_modules_can_be_imported" fails when a module

				  # cannot be imported.

				  new_module_dir=$(mktemp XXXXXXXX -d -p "${TORCH_INSTALL_DIR}")

				  echo "invalid syntax garbage" > "${new_module_dir}/__init__.py"

				  invalid_module_name="torch.$(basename "${new_module_dir}")"

				  check_public_api_test_fails \

				      "test_modules_can_be_imported" \

				      "${invalid_module_name}" \

				      "a non-importable module" && ret=$? || ret=$?

				  rm -rv "${new_module_dir}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Next, build torch from the merge base.

				  REPO_DIR=$(pwd)

				  if [[ "${BASE_SHA}" == "${SHA1}" ]]; then

				    echo "On trunk, we should compare schemas with torch built from the parent commit"

				@ -1169,18 +1338,24 @@ test_executorch() {

				  pushd /executorch

				  # NB: We need to build ExecuTorch runner here and not inside the Docker image

				  # because it depends on PyTorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  # shellcheck disable=SC1091

				  source .ci/scripts/utils.sh

				  build_executorch_runner "cmake"

				  source .ci/scripts/setup-linux.sh cmake

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				  # shellcheck disable=SC1091

				  LLVM_PROFDATA=llvm-profdata-12 LLVM_COV=llvm-cov-12 bash test/run_oss_cpp_tests.sh

				  echo "Run ExecuTorch regression tests for some models"

				  # NB: This is a sample model, more can be added here

				  export PYTHON_EXECUTABLE=python

				  # TODO(huydhn): Add more coverage here using ExecuTorch's gather models script

				  # shellcheck disable=SC1091

				  source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''

				  source .ci/scripts/test_model.sh mv3 cmake xnnpack-quantization-delegation ''

				  popd

				@ -1191,14 +1366,16 @@ test_executorch() {

				  assert_git_not_dirty

				}

				test_linux_aarch64(){

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				       test_transformers test_multiprocessing test_numpy_interop --verbose

				        test_transformers test_multiprocessing test_numpy_interop \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				  python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \

				       dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Inductor tests

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \

				@ -1208,14 +1385,15 @@ test_linux_aarch64(){

				       inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \

				       inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \

				       inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				fi

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				if [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then

				  test_linux_aarch64

				elif [[ "${TEST_CONFIG}" == *backward* ]]; then

				  test_forward_backward_compatibility

				@ -1237,11 +1415,12 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_rpc

				  fi

				elif [[ "$TEST_CONFIG" == deploy ]]; then

				  checkout_install_torchdeploy

				  test_torch_deploy

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				  test_inductor_halide

				elif [[ "${TEST_CONFIG}" == *inductor-triton-cpu* ]]; then

				  test_inductor_triton_cpu

				elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then

				  test_inductor_micro_benchmark

				elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then

				@ -1253,23 +1432,23 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				  id=$((SHARD_NUMBER-1))

				  test_dynamo_benchmark timm_models "$id"

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    install_torchaudio cpu

				  else

				    install_torchaudio cuda

				  fi

				  install_torchtext

				  install_torchvision

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer

				    checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      shufflenet_v2_x1_0 hf_GPT2

				      functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then

				    checkout_install_torchbench

				@ -1278,25 +1457,30 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				    # nightlies that torchbench may pull in

				    if [[ "${TEST_CONFIG}" != *cpu_inductor* ]]; then

				    if [[ "${TEST_CONFIG}" != *cpu* ]]; then

				      install_torchrec_and_fbgemm

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then

				  install_torchaudio cuda

				  install_torchvision

				  test_inductor_cpp_wrapper_abi_compatible

				elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then

				  checkout_install_torchbench hf_T5 llama moco

				  PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  install_torchvision

				  test_dynamo_shard 1

				  test_aten

				elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				  install_torchvision

				  test_dynamo_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_aten

				  fi

				elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  install_torchvision

				  test_python_shard "$SHARD_NUMBER"

									
										2

.ci/pytorch/win-build.sh
									
												View File
												
				@ -26,7 +26,7 @@ fi

				export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

				set +ex

				grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=eval_frame.c torch/

				grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h  --exclude=pythoncapi_compat.h --exclude=eval_frame.c torch/

				PYLONG_API_CHECK=$?

				if [[ $PYLONG_API_CHECK == 0 ]]; then

				  echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

									
										23

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -24,6 +24,12 @@ call %INSTALLER_DIR%\install_sccache.bat

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Install xpu support packages

				  call %INSTALLER_DIR%\install_xpu.bat

				  if errorlevel 1 exit /b 1

				)

				:: Miniconda has been installed as part of the Windows AMI with all the dependencies.

				:: We just need to activate it here

				call %INSTALLER_DIR%\activate_miniconda3.bat

				@ -43,6 +49,16 @@ if "%VC_VERSION%" == "" (

				)

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Activate xpu environment - VS env is required for xpu

				  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"

				  if errorlevel 1 exit /b 1

				  :: Reduce build time. Only have MTL self-hosted runner now

				  SET TORCH_XPU_ARCH_LIST=xe-lpg

				  SET USE_KINETO=0

				)

				@echo on

				popd

				@ -65,13 +81,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				:cuda_build_end

									
										91

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,91 @@

				@echo on

				REM Description: Install Intel Support Packages on Windows

				REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				set XPU_INSTALL_MODE=%~1

				if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start

				if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start

				:arg_error

				echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"

				echo If keep the value as space, will use default "bundle" mode

				exit /b 1

				:xpu_driver_install_start

				:: TODO Need more testing for driver installation

				set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe

				curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%

				echo "XPU Driver installing..."

				start /wait "Intel XPU Driver Installer" "xpu_driver.exe"

				if errorlevel 1 exit /b 1

				del xpu_driver.exe

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end

				:xpu_bundle_install_start

				set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe

				set XPU_PTI_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe

				set XPU_BUNDLE_VERSION=0.5.3+31

				set XPU_PTI_VERSION=0.9.0+36

				set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product

				set XPU_PTI_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product

				set XPU_BUNDLE_INSTALLED=0

				set XPU_PTI_INSTALLED=0

				set XPU_BUNDLE_UNINSTALL=0

				set XPU_PTI_UNINSTALL=0

				:: Check if XPU bundle is target version or already installed

				if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check

				goto xpu_bundle_install

				:xpu_bundle_ver_check

				"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log

				for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (

				    if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_BUNDLE_INSTALLED=1

				        if not "%XPU_BUNDLE_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_BUNDLE_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle

				            set XPU_BUNDLE_UNINSTALL=1

				        )

				    )

				    if "%%a"=="%XPU_PTI_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_PTI_INSTALLED=1

				        if not "%XPU_PTI_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_PTI_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle

				            set XPU_PTI_UNINSTALL=1

				        )

				    )

				)

				if errorlevel 1 exit /b 1

				if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log

				if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install

				if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install

				if "%XPU_PTI_INSTALLED%"=="0" goto xpu_pti_install

				if "%XPU_PTI_UNINSTALL%"=="1" goto xpu_pti_install

				goto xpu_install_end

				:xpu_bundle_install

				curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%

				echo "XPU Bundle installing..."

				start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_bundle.exe

				:xpu_pti_install

				curl -o xpu_pti.exe --retry 3 --retry-all-errors -k %XPU_PTI_URL%

				echo "XPU PTI installing..."

				start /wait "Intel PTI Installer" "xpu_pti.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_pti.exe

				:xpu_install_end

									
										1

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py
									
												View File
												
				@ -4,6 +4,7 @@ import os

				import subprocess

				import sys

				COMMON_TESTS = [

				    (

				        "Checking that torch is available",

									
										1

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
				@ -40,7 +40,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				set NUMBAPRO_CUDALIB=%CUDA_PATH%\bin

				set NUMBAPRO_LIBDEVICE=%CUDA_PATH%\nvvm\libdevice

									
										2

.ci/pytorch/win-test-helpers/test_custom_backend.bat
									
												View File
												
				@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1

				:: Run tests C++-side and load the exported script module.

				cd build

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				test_custom_backend.exe model.pt

				if ERRORLEVEL 1 exit /b 1

									
										2

.ci/pytorch/win-test-helpers/test_custom_script_ops.bat
									
												View File
												
				@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1

				:: Run tests C++-side and load the exported script module.

				cd build

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				test_custom_ops.exe model.pt

				if ERRORLEVEL 1 exit /b 1

									
										2

.ci/pytorch/win-test-helpers/test_libtorch.bat
									
												View File
												
				@ -5,7 +5,7 @@ if errorlevel 1 exit /b 1

				set CWD=%cd%

				set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\bin

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set TORCH_CPP_TEST_MNIST_PATH=%CWD%\test\cpp\api\mnist

				python tools\download_mnist.py --quiet -d %TORCH_CPP_TEST_MNIST_PATH%

									
										9

.ci/pytorch/win-test.sh
									
												View File
												
				@ -40,6 +40,15 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.12.2.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.25

				# Install parameterized

				python -m pip install parameterized==0.8.1

				# Install pulp for testing ilps under torch\distributed\_tools

				python -m pip install pulp==2.9.0

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

									
										1

.circleci/codegen_validation/normalize_yaml_fragment.py
									
												View File
												
				@ -5,6 +5,7 @@ import sys

				import yaml

				# Need to import modules that lie on an upward-relative path

				sys.path.append(os.path.join(sys.path[0], ".."))

									
										35

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -27,12 +27,11 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				  source activate testenv >/dev/null

				elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"

				  # Prior to Python 3.8 paths were suffixed with an 'm'

				  if [[ -d  "\${python_path}/bin" ]]; then

				    export PATH="\${python_path}/bin:\$PATH"

				  elif [[ -d "\${python_path}m/bin" ]]; then

				    export PATH="\${python_path}m/bin:\$PATH"

				  if [[ "\$python_nodot" = *t ]]; then

				    python_digits="\$(echo $DESIRED_PYTHON | tr -cd [:digit:])"

				    python_path="/opt/python/cp\$python_digits-cp\${python_digits}t"

				  fi

				  export PATH="\${python_path}/bin:\$PATH"

				fi

				EXTRA_CONDA_FLAGS=""

				@ -46,14 +45,12 @@ if [[ "\$python_nodot" = *310* ]]; then

				  PROTOBUF_PACKAGE="protobuf>=3.19.0"

				fi

				if [[ "\$python_nodot" = *39*  ]]; then

				if [[ "\$python_nodot" = *39* ]]; then

				  # There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20

				  # we set a lower boundary here just to be safe

				  NUMPY_PIN=">=1.20"

				fi

				# Move debug wheels out of the package dir so they don't get installed

				mkdir -p /tmp/debug_final_pkgs

				mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to move"

				@ -83,7 +80,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				      "numpy\${NUMPY_PIN}" \

				      mkl>=2018 \

				      ninja \

				      sympy \

				      sympy>=1.12 \

				      typing-extensions \

				      ${PROTOBUF_PACKAGE}

				    if [[ "$DESIRED_CUDA" == 'cpu' ]]; then

				@ -97,8 +94,16 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				  )

				elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				    retry pip install -q numpy protobuf typing-extensions

				    if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				      pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"

				      pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"

				      # todo: after folder is populated use the pypi_pkg channel instead

				      pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"

				      retry pip install -q numpy protobuf typing-extensions

				    else

				      pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				      retry pip install -q numpy protobuf typing-extensions

				    fi

				  else

				    pip install "\$pkg"

				    retry pip install -q numpy protobuf typing-extensions

				@ -113,6 +118,14 @@ fi

				# Test the package

				/builder/check_binary.sh

				if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm*  && "$PACKAGE_TYPE" != libtorch ]]; then

				  # Exclude s390, xpu, rocm and libtorch builds from smoke testing

				  python /builder/test/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled

				fi

				# Clean temp files

				cd /builder && git clean -ffdx

				# =================== The above code will be executed inside Docker container ===================

				EOL

				echo

									
										59

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -33,9 +33,9 @@ if [[ -z "$DOCKER_IMAGE" ]]; then

				  if [[ "$PACKAGE_TYPE" == conda ]]; then

				    export DOCKER_IMAGE="pytorch/conda-cuda"

				  elif [[ "$DESIRED_CUDA" == cpu ]]; then

				    export DOCKER_IMAGE="pytorch/manylinux-cpu"

				    export DOCKER_IMAGE="pytorch/manylinux:cpu"

				  else

				    export DOCKER_IMAGE="pytorch/manylinux-cuda${DESIRED_CUDA:2}"

				    export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"

				  fi

				fi

				@ -75,9 +75,9 @@ export PYTORCH_BUILD_NUMBER=1

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.13 are supported wheels for triton

				  TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				@ -87,11 +87,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:

				fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				@ -100,30 +100,24 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B

				    fi

				fi

				JAVA_HOME=

				BUILD_JNI=OFF

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  POSSIBLE_JAVA_HOMES=()

				  POSSIBLE_JAVA_HOMES+=(/usr/local)

				  POSSIBLE_JAVA_HOMES+=(/usr/lib/jvm/java-8-openjdk-amd64)

				  POSSIBLE_JAVA_HOMES+=(/Library/Java/JavaVirtualMachines/*.jdk/Contents/Home)

				  # Add the Windows-specific JNI path

				  POSSIBLE_JAVA_HOMES+=("$PWD/pytorch/.circleci/windows-jni/")

				  for JH in "${POSSIBLE_JAVA_HOMES[@]}" ; do

				    if [[ -e "$JH/include/jni.h" ]] ; then

				      # Skip if we're not on Windows but haven't found a JAVA_HOME

				      if [[ "$JH" == "$PWD/pytorch/.circleci/windows-jni/" && "$OSTYPE" != "msys" ]] ; then

				        break

				      fi

				      echo "Found jni.h under $JH"

				      JAVA_HOME="$JH"

				      BUILD_JNI=ON

				      break

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				  done

				  if [ -z "$JAVA_HOME" ]; then

				    echo "Did not find jni.h"

				  fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				    else

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"

				    fi

				fi

				USE_GLOO_WITH_OPENSSL="ON"

				if [[ "$GPU_ARCH_TYPE" =~ .*aarch64.* ]]; then

				  USE_GLOO_WITH_OPENSSL="OFF"

				  USE_GOLD_LINKER="OFF"

				fi

				cat >"$envfile" <<EOL

				@ -136,6 +130,7 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"

				export DESIRED_CUDA="$DESIRED_CUDA"

				export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"

				export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"

				export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"

				if [[ "${OSTYPE}" == "msys" ]]; then

				  export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"

				  if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then

				@ -159,14 +154,12 @@ export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'

				export ANACONDA_USER='pytorch'

				export USE_FBGEMM=1

				export JAVA_HOME=$JAVA_HOME

				export BUILD_JNI=$BUILD_JNI

				export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"

				export DOCKER_IMAGE="$DOCKER_IMAGE"

				export USE_GOLD_LINKER="${USE_GOLD_LINKER}"

				export USE_GLOO_WITH_OPENSSL="ON"

				export USE_GLOO_WITH_OPENSSL="${USE_GLOO_WITH_OPENSSL}"

				# =================== The above code will be executed inside Docker container ===================

				EOL

									
										9

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -25,6 +25,15 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then

				  AWS_S3_CP="aws s3 cp"

				fi

				if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"

				fi

				# this is special build with all dependencies packaged

				if [[ ${BUILD_NAME} == *-full* ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"

				fi

				# Sleep 2 minutes between retries for conda upload

				retry () {

				  "$@"  || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")

									
										5

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -10,6 +10,11 @@ export SCCACHE_BUCKET=ossci-compiler-cache

				export SCCACHE_IGNORE_SERVER_IO_ERROR=1

				export VC_YEAR=2019

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export USE_SCCACHE=0

				fi

				echo "Free space on filesystem before build:"

				df -h

									
										4

.circleci/scripts/binary_windows_test.sh
									
												View File
												
				@ -6,6 +6,10 @@ source "${BINARY_ENV_FILE:-/c/w/env}"

				export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export VC_YEAR=2019

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				fi

				pushd "$BUILDER_ROOT"

				./windows/internal/smoke_test.bat

									
										1

.circleci/scripts/trigger_azure_pipeline.py
									
												View File
												
				@ -8,6 +8,7 @@ import time

				import requests

				AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"

				AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")

				PIPELINE_ID = "911"

28

.clang-format

View File

 @ -44,7 +44,9 @@ ContinuationIndentWidth: 4
 Cpp11BracedListStyle: true
 DerivePointerAlignment: false
 DisableFormat:   false
 ForEachMacros:   [ FOR_EACH_RANGE, FOR_EACH, ]
 ForEachMacros:
   - FOR_EACH_RANGE
   - FOR_EACH
 IncludeCategories:
   - Regex:           '^<.*\.h(pp)?>'
     Priority:        1
 @ -58,6 +60,24 @@ IndentWrappedFunctionNames: false
 KeepEmptyLinesAtTheStartOfBlocks: false
 MacroBlockBegin: ''
 MacroBlockEnd:   ''
 Macros:
   - >-
     PyObject_HEAD_INIT(type)={
         /* this is not exactly match with PyObject_HEAD_INIT in Python source code
          * but it is enough for clang-format */
         { 0xFFFFFFFF },
         (type)
     },
   - >-
     PyVarObject_HEAD_INIT(type, size)={
         {
             /* manually expand PyObject_HEAD_INIT(type) above
              * because clang-format do not support recursive expansion */
             { 0xFFFFFFFF },
             (type)
         },
         (size)
     },
 MaxEmptyLinesToKeep: 1
 NamespaceIndentation: None
 PenaltyBreakBeforeFirstCallParameter: 1
 @ -79,7 +99,11 @@ SpacesInContainerLiterals: true
 SpacesInCStyleCastParentheses: false
 SpacesInParentheses: false
 SpacesInSquareBrackets: false
 Standard:        Cpp11
 Standard:        c++17
 StatementMacros:
   - PyObject_HEAD
   - PyObject_VAR_HEAD
   - PyException_HEAD
 TabWidth:        8
 UseTab:          Never
 ---

									
										2

.devcontainer/scripts/install-dev-tools.sh
									
												View File
												
				@ -5,7 +5,7 @@ git submodule sync

				git submodule update --init --recursive

				# This takes some time

				make setup_lint

				make setup-lint

				# Add CMAKE_PREFIX_PATH to bashrc

				echo 'export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}' >> ~/.bashrc

7

.flake8

View File

 @ -2,12 +2,12 @@
 # NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
 # before we can fully move to use ruff
 enable-extensions = G
 select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 select = B,C,E,F,G,P,SIM1,SIM911,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 ignore =
     E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     # shebang has extra meaning in fbcode lints, so I think it's not worth trying
     # to line this up with executable bit
     EXE001,
 @ -55,6 +55,9 @@ per-file-ignores =
     torch/distributed/_functional_collectives.py: TOR901
     torch/distributed/_spmd/data_parallel.py: TOR901
     torch/distributed/_tensor/_collective_utils.py: TOR901
     # This is a full package that happen to live within the test
     # folder, so ok to skip
     test/cpp_extensions/open_registration_extension/pytorch_openreg/_aten_impl.py: TOR901
 optional-ascii-coding = True
 exclude =
     ./.git,

4

.git-blame-ignore-revs

View File

 @ -40,3 +40,7 @@ e6ec0efaf87703c5f889cfc20b29be455885d58d
 a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
 # 2024-01-02 clangformat: fused adam #116583
 dc68d1aa9e554d09344a10fff69f7b50b2d23a0
 # 2024-06-28 enable UFMT in `torch/storage.py`
 d80939e5e9337e8078f11489afefec59fd42f93b
 # 2024-06-28 enable UFMT in `torch.utils.data`
 cf0b90e49689d45be91aa539fdf54cf2ea8a9a3

Compare commits

5752 Commits cslpull78 ... PR-NoneBug

3 .buckconfig.oss Unescape Escape View File

10 .ci/docker/README.md Unescape Escape View File

8 .ci/docker/aotriton_version.txt Unescape Escape View File

99 .ci/docker/build.sh Unescape Escape View File

4 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/halide.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/timm.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-cpu.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

4 .ci/docker/common/install_aotriton.sh Unescape Escape View File

11 .ci/docker/common/install_clang.sh Unescape Escape View File

40 .ci/docker/common/install_conda.sh Unescape Escape View File

20 .ci/docker/common/install_conda_docker.sh Executable file Unescape Escape View File

112 .ci/docker/common/install_cpython.sh Executable file Unescape Escape View File

319 .ci/docker/common/install_cuda.sh Normal file Unescape Escape View File

93 .ci/docker/common/install_cuda_aarch64.sh Normal file Unescape Escape View File

25 .ci/docker/common/install_cudss.sh Normal file Unescape Escape View File

10 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

14 .ci/docker/common/install_executorch.sh Unescape Escape View File

46 .ci/docker/common/install_halide.sh Normal file Unescape Escape View File

23 .ci/docker/common/install_libpng.sh Normal file Unescape Escape View File

29 .ci/docker/common/install_magma.sh Normal file Unescape Escape View File

172 .ci/docker/common/install_miopen.sh Normal file Unescape Escape View File

16 .ci/docker/common/install_mkl.sh Normal file Unescape Escape View File

13 .ci/docker/common/install_mnist.sh Normal file Unescape Escape View File

20 .ci/docker/common/install_nvpl.sh Normal file Unescape Escape View File

11 .ci/docker/common/install_onnx.sh Unescape Escape View File

22 .ci/docker/common/install_openblas.sh Normal file Unescape Escape View File

16 .ci/docker/common/install_patchelf.sh Normal file Unescape Escape View File

150 .ci/docker/common/install_rocm_drm.sh Normal file Unescape Escape View File

13 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

31 .ci/docker/common/install_triton.sh Unescape Escape View File

118 .ci/docker/common/install_xpu.sh Unescape Escape View File

105 .ci/docker/conda/Dockerfile Normal file Unescape Escape View File

82 .ci/docker/conda/build.sh Executable file Unescape Escape View File

107 .ci/docker/libtorch/Dockerfile Normal file Unescape Escape View File

93 .ci/docker/libtorch/build.sh Executable file Unescape Escape View File

2 .ci/docker/linter-cuda/Dockerfile Unescape Escape View File

203 .ci/docker/manywheel/Dockerfile Normal file Unescape Escape View File

153 .ci/docker/manywheel/Dockerfile_2014 Normal file Unescape Escape View File

157 .ci/docker/manywheel/Dockerfile_2_28 Normal file Unescape Escape View File

57 .ci/docker/manywheel/Dockerfile_2_28_aarch64 Normal file Unescape Escape View File

94 .ci/docker/manywheel/Dockerfile_aarch64 Normal file Unescape Escape View File

91 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Normal file Unescape Escape View File

71 .ci/docker/manywheel/Dockerfile_cxx11-abi Normal file Unescape Escape View File

73 .ci/docker/manywheel/Dockerfile_s390x Normal file Unescape Escape View File

161 .ci/docker/manywheel/build.sh Executable file Unescape Escape View File

131 .ci/docker/manywheel/build_scripts/build.sh Normal file Unescape Escape View File

91 .ci/docker/manywheel/build_scripts/build_utils.sh Executable file Unescape Escape View File

60 .ci/docker/manywheel/build_scripts/manylinux1-check.py Normal file Unescape Escape View File

31 .ci/docker/manywheel/build_scripts/ssl-check.py Normal file Unescape Escape View File

76 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

14 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

9 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

1 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

17 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

10 .ci/libtorch/build.sh Normal file Unescape Escape View File

21 .ci/manywheel/LICENSE Normal file Unescape Escape View File

25 .ci/manywheel/build.sh Executable file Unescape Escape View File

505 .ci/manywheel/build_common.sh Normal file Unescape Escape View File

99 .ci/manywheel/build_cpu.sh Executable file Unescape Escape View File

290 .ci/manywheel/build_cuda.sh Normal file Unescape Escape View File

353 .ci/manywheel/build_libtorch.sh Normal file Unescape Escape View File

263 .ci/manywheel/build_rocm.sh Executable file Unescape Escape View File

26 .ci/manywheel/test_wheel.sh Executable file Unescape Escape View File

41 .ci/pytorch/README.md Unescape Escape View File

68 .ci/pytorch/build.sh Unescape Escape View File

61 .ci/pytorch/common_utils.sh Unescape Escape View File

13 .ci/pytorch/create_test_cert.py Unescape Escape View File

19 .ci/pytorch/macos-test.sh Unescape Escape View File

10 .ci/pytorch/multigpu-test.sh Unescape Escape View File

1 .ci/pytorch/perf_test/compare_with_baseline.py Unescape Escape View File

1 .ci/pytorch/perf_test/get_stats.py Unescape Escape View File

1 .ci/pytorch/perf_test/update_commit_hash.py Unescape Escape View File

5752 Commits

cslpull78 ... PR-NoneBug

3

.buckconfig.oss

View File

10

.ci/docker/README.md

View File

8

.ci/docker/aotriton_version.txt

View File

99

.ci/docker/build.sh

View File

4

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

1

.ci/docker/ci_commit_pins/halide.txt Normal file

View File

2

.ci/docker/ci_commit_pins/timm.txt

View File

1

.ci/docker/ci_commit_pins/triton-cpu.txt Normal file

View File

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

4

.ci/docker/common/install_aotriton.sh

View File

11

.ci/docker/common/install_clang.sh

View File

40

.ci/docker/common/install_conda.sh

View File

20

.ci/docker/common/install_conda_docker.sh Executable file

View File

112

.ci/docker/common/install_cpython.sh Executable file

View File

319

.ci/docker/common/install_cuda.sh Normal file

View File

93

.ci/docker/common/install_cuda_aarch64.sh Normal file

View File

25

.ci/docker/common/install_cudss.sh Normal file

View File

10

.ci/docker/common/install_cusparselt.sh

View File

14

.ci/docker/common/install_executorch.sh

View File

46

.ci/docker/common/install_halide.sh Normal file

View File

23

.ci/docker/common/install_libpng.sh Normal file

View File

29

.ci/docker/common/install_magma.sh Normal file

View File

172

.ci/docker/common/install_miopen.sh Normal file

View File

16

.ci/docker/common/install_mkl.sh Normal file

View File

13

.ci/docker/common/install_mnist.sh Normal file

View File

20

.ci/docker/common/install_nvpl.sh Normal file

View File

11

.ci/docker/common/install_onnx.sh

View File

22

.ci/docker/common/install_openblas.sh Normal file

View File

16

.ci/docker/common/install_patchelf.sh Normal file

View File

150

.ci/docker/common/install_rocm_drm.sh Normal file

View File

13

.ci/docker/common/install_rocm_magma.sh

View File

31

.ci/docker/common/install_triton.sh

View File

118

.ci/docker/common/install_xpu.sh

View File

105

.ci/docker/conda/Dockerfile Normal file

View File

82

.ci/docker/conda/build.sh Executable file

View File

107

.ci/docker/libtorch/Dockerfile Normal file

View File

93

.ci/docker/libtorch/build.sh Executable file

View File

2

.ci/docker/linter-cuda/Dockerfile

View File

203

.ci/docker/manywheel/Dockerfile Normal file

View File

153

.ci/docker/manywheel/Dockerfile_2014 Normal file

View File

157

.ci/docker/manywheel/Dockerfile_2_28 Normal file

View File

57

.ci/docker/manywheel/Dockerfile_2_28_aarch64 Normal file

View File

94

.ci/docker/manywheel/Dockerfile_aarch64 Normal file

View File

91

.ci/docker/manywheel/Dockerfile_cuda_aarch64 Normal file

View File

71

.ci/docker/manywheel/Dockerfile_cxx11-abi Normal file

View File

73

.ci/docker/manywheel/Dockerfile_s390x Normal file

View File

161

.ci/docker/manywheel/build.sh Executable file

View File

131

.ci/docker/manywheel/build_scripts/build.sh Normal file

View File

91

.ci/docker/manywheel/build_scripts/build_utils.sh Executable file

View File

60

.ci/docker/manywheel/build_scripts/manylinux1-check.py Normal file

View File

31

.ci/docker/manywheel/build_scripts/ssl-check.py Normal file

View File

76

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

14

.ci/docker/ubuntu-cuda/Dockerfile

View File

9

.ci/docker/ubuntu-rocm/Dockerfile

View File

1

.ci/docker/ubuntu-xpu/Dockerfile

View File

17

.ci/docker/ubuntu/Dockerfile

View File

10

.ci/libtorch/build.sh Normal file

View File

21

.ci/manywheel/LICENSE Normal file

View File

25

.ci/manywheel/build.sh Executable file

View File

505

.ci/manywheel/build_common.sh Normal file

View File

99

.ci/manywheel/build_cpu.sh Executable file

View File

290

.ci/manywheel/build_cuda.sh Normal file

View File

353

.ci/manywheel/build_libtorch.sh Normal file

View File

263

.ci/manywheel/build_rocm.sh Executable file

View File

26

.ci/manywheel/test_wheel.sh Executable file

View File

41

.ci/pytorch/README.md

View File

68

.ci/pytorch/build.sh

View File

61

.ci/pytorch/common_utils.sh

View File

13

.ci/pytorch/create_test_cert.py

View File

19

.ci/pytorch/macos-test.sh

View File

10

.ci/pytorch/multigpu-test.sh

View File

1

.ci/pytorch/perf_test/compare_with_baseline.py

View File

1

.ci/pytorch/perf_test/get_stats.py

View File

1

.ci/pytorch/perf_test/update_commit_hash.py

View File

1

.ci/pytorch/print_sccache_log.py

View File